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I.  Executive  Summary 

In  this  project  we  have  investigated  the  architecture  and  design  of  manycorc  proccssor-to-DRAM 
networks  using  integrated  silicon  photonics.  We  have  focused  primarily  on  two  types  of  networks,  on- 
dic  core-to-corc  network,  and  core-to-memory  controller  network  that  possibly  connects  several 
processor  sockets  into  a  seamless,  flat  shared-memory  systems. 

In  the  context  of  core-to-corc  networks,  we  have  explored  the  constraints  on  photonic  technology 
imposed  by  the  implementation  of  non-blocking  networks  such  as  crossbars  and  Clos.  We  developed  a 
comprehensive  modeling  framework  for  estimation  of  optical  and  electrical  power  requirements  for 
various  physical  network  topologies.  We  have  shown  that  in  an  example  64-tile  system  photonic  Clos 
network  consumes  significantly  less  optical  power,  thermal  tuning  power  and  area,  compared  to  global 
photonic  crossbars,  over  a  range  of  photonic  device  parameters.  The  results  from  our  network  simulation 
framework  indicate  that  compared  to  various  other  electrical  on-chip  networks,  photonic  Clos  networks 
can  provide  more  uniform  latency  and  throughput  across  a  range  of  traffic  patterns  while  consuming  less 
power.  These  properties  will  help  simplify  parallel  programming  by  allowing  the  programmer  to  ignore 
network  topology  during  optimization.  The  first  part  of  the  report  includes  our  publication  of  these 
findings,  presented  at  the  International  Symposium  for  Networks  on  Chip  in  May  2009. 

In  the  context  of  multi-socket  corc-to-memory  controller  networks  we  explored  the  use  of  silicon 
photonics  to  build  relatively  flat,  high  bandwidth  memory  interconnect.  In  this  work,  we  present  a 
scalable  and  coherent  multi-socket  design  along  with  discussing  the  tradeoffs  facing  an  architect  when 
incorporating  silicon  photonics  technology.  This  work  also  points  to  an  important  indirect  impact  of 
using  efficient  interconnect  technology  like  silicon  photonics  -  the  impact  on  yield  and  size  of  processor 
chips.  By  using  the  efficient  photonic  interconnect,  the  motivation  to  integrate  cores  into  large  processor  * 
chips  disintegrates,  leaving  room  for  die-size  optimization  to  support  yield  improvements,  case  of 
packaging,  cooling  and  power  delivery.  Details  of  this  work  are  provided  in  the  second  part  of  the 
technical  report,  presented  at  the  International  Conference  on  Supcrcomputing  in  June  2009. 
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II.  Detailed  Technical  Information 

A.  Design  of  Non-Blocking  Core-to-Core  Photonic  Networks 
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Abstract 

Future  manycore  processors  will  require  energy- 
efficient,  high-throughput  on-chip  networks.  Silicon- 
photonics  is  a  promising  new  interconnect  technology 
which  offers  lower  power,  higher  bandwidth  density,  and 
shorter  latencies  than  electrical  interconnects.  In  this 
paper  we  explore  using  photonics  to  implement  low- 
diameter  non-blocking  crossbar  and  Cl  os  nenv^orks.  We 
u.se  analytical  modeling  to  show  that  a  64-tile  photonic 
Clos  network  consumes  significantly  less  optical  power, 
thermal  tuning  power,  and  area  compared  to  global  pho¬ 
tonic  crossbars  over  a  range  of  photonic  device  param¬ 
eters.  Compared  to  various  electrical  on-chip  networks, 
our  simulation  results  indicate  that  a  photonic  Clos  net¬ 
work  can  provide  more  uniform  latency  and  throughput 
across  a  range  of  traffic  patterns  while  consuming  less 
power.  These  properties  will  help  simplify  parallel  pro¬ 
gramming  by  allowing  the  programmer  to  ignore  network 
topology  during  optimization. 

1.  Introduction 

Today’s  graphics,  network,  embedded  and  server  pro¬ 
cessors  already  contain  many  processor  cores  on  one  chip 
and  this  number  is  expected  to  increase  with  future  scal¬ 
ing.  The  on-chip  communication  network  is  becoming  a 
critical  component,  affecting  not  only  performance  and 
power  consumption,  but  also  programmer  productivity. 
From  a  software  perspective,  an  ideal  network  would  have 
uniformly  low  latency  and  uniformly  high  bandwidth. 
The  electrical  on-chip  networks  used  in  today’s  multi  core 
systems  (e.g.,  crossbars  [8],  meshes  [3],  and  rings  [11]) 
will  either  be  difficult  to  scale  to  higher  core  counts  with 
reasonable  power  and  area  overheads  or  introduce  signif¬ 
icant  bandwidth  and  latency  non-uniformities.  In  this  pa¬ 
per  we  explore  the  use  of  silicon-photonic  technology  to 
build  on-chip  networks  that  scale  well,  and  provide  uni¬ 
formly  low  latency  and  uniformly  high  bandwidth. 

Various  photonic  materials  and  integration  approaches 
have  been  proposed  to  enable  efficient  global  on-chip 
communication,  and  several  network  architectures  (e.g., 
crossbars  [7, 15]  and  meshes  [13])  have  been  developed 
bottom-up  using  fixed  device  technology  parameters  as 


drivers.  In  this  paper,  we  take  a  top-down  approach  by 
driving  the  photonic  device  requirements  based  on  the 
projected  network  and  system  needs.  This  allows  quick 
design-space  exploration  at  the  network  level,  and  pro¬ 
vides  insight  into  which  network  topologies  can  best  har¬ 
ness  the  advantages  of  photonics  at  different  stages  of  the 
technology  roadmap. 

This  paper  begins  by  identifying  our  target  system  and 
briefly  reviewing  the  electrical  on-chip  networks  which 
will  serve  as  a  baseline  for  our  photonic  network  pro¬ 
posals.  We  then  use  analytical  models  to  investigate  the 
tradeoffs  between  various  implementations  of  global  pho¬ 
tonic  crossbars  found  in  the  literature  and  our  own  imple¬ 
mentations  of  photonic  Clos  networks.  We  also  use  sim¬ 
ulations  to  compare  the  photonic  Clos  network  to  elec¬ 
trical  mesh  and  Clos  networks.  Our  results  show  that 
photonic  Clos  networks  consume  significantly  less  optical 
laser  power,  thermal  tuning  power,  and  area  as  compared 
to  photonic  crossbar  networks,  and  offer  better  energy- 
efficiency  than  electrical  networks  while  providing  more 
uniform  performance  across  various  traffic  patterns. 

2.  Target  System 

Silicon-photonic  technology  for  on-chip  communica¬ 
tion  is  still  in  its  formative  stages,  but  with  recent  technol¬ 
ogy  advances  we  project  that  photonics  might  be  viable  in 
the  late  2010’s.  This  makes  the  22  nm  node  a  reasonable 
target  process  technology  for  our  work.  By  then  it  will 
be  possible  to  integrate  hundreds  of  cores  onto  a  single 
die.  To  simplify  design  and  verification  complexity,  these 
cores  and/or  memory  will  most  likely  be  clustered  into 
tiles  which  are  then  replicated  across  the  chip  and  inter¬ 
connected  with  a  well-structured  on-chip  network.  The 
exact  nature  of  the  tiles  and  the  inter-tile  communication 
paradigm  arc  still  active  areas  of  research.  The  tiles  might 
be  homogeneous  with  each  tile  including  both  some  num¬ 
ber  of  cores  and  a  slice  of  the  on-chip  memory,  or  the 
tiles  might  be  heterogeneous  with  a  mix  of  compute  and 
memory  tiles.  The  global  on-chip  network  might  be  used 
to  implement  shared  memory,  message  passing,  or  both. 
Regardless  of  their  exact  configuration,  however,  all  fu¬ 
ture  systems  will  require  some  form  of  on-chip  network 
which  provides  low-latency  and  high-throughput  commu- 
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(c)  CMcsh 


(d)  Clos 


Figure  1:  Logical  View  of  64  Tile  Network  Topologies  -  (a)  64x64  distributed  tristate  global 
crossbar,  (b)  2D  8x8  mesh,  (c)  concentrated  mesh  (cmesh)  with  4x  concentration,  (d)  8-ary,  3*stage 
Clos  network  with  eight  middle  routers.  In  all  four  figures:  squares  =  tiles,  dots  =  routers,  triangles 
=  tristate  buffers.  In  (b)  and  (c)  inter-dot  lines  =  two  opposite  direction  channels.  In  (a)  and  (d) 
inter-dot  lines  =  uni  directional  channels. 


Figure  2:  Clos  Layout 
-  Router  group  is  three 
routers.  Only  a  subset  of 
the  channels  are  shown. 


nication  at  low  energy  and  small  area. 

For  this  paper  we  assume  a  target  system  with  64 
square  tiles  operating  at  5  GHz  on  a  400 mm^  chip.  Fig¬ 
ure  1  illustrates  some  of  the  topologies  available  for  im¬ 
plementing  on-chip  networks.  They  range  from  high- 
radix,  low-diameter  crossbar  networks  to  low-radix,  high- 
diameter  mesh  networks.  We  examine  networks  sized  for 
low  (LTBw),  medium  (MTBw),  and  high  (FTTBw)  band¬ 
width  which  correspond  to  ideal  throughputs  of  64,  1 28, 
and  256b/cyclc  per  tile  under  uniform  random  traffic.  Al¬ 
though  we  primarily  focus  on  a  single  on-chip  network, 
our  exploration  approach  is  also  applicable  to  future  sys¬ 
tems  with  multiple  physical  networks. 

3.  Electrical  On-Chip  Networks 

In  this  section,  we  explore  the  qualitative  trade-offs  be¬ 
tween  various  network  architectures  that  use  traditional 
electrical  interconnect.  This  will  provide  an  electrical 
baseline  for  comparison,  and  also  yield  insight  into  the 
best  way  to  leverage  silicon  photonics. 

3.1.  Electrical  Technology 

The  performance  and  cost  of  on-chip  networks  depend 
heavily  on  various  technology  parameters.  For  this  work 
we  use  the  22  nm  predictive  technology  models  [16]  and 
interconnect  projections  from  [6]  and  the  ITRS. 

All  of  our  inter- router  channels  are  implemented  in 
semi-global  metal  layers  with  standard  repeated  wires. 
For  medium  length  wires  (2-3  mm  or  approximately  the 
width  of  a  tile)  the  repeater  sizing  and  spacing  are  cho¬ 
sen  so  as  to  minimize  the  energy  for  the  target  cycle-time. 
Longer  wires  are  energy  optimized  as  well  as  pipelined 
to  maintain  throughput.  The  average  energy  to  trans¬ 
mit  a  bit  transition  over  a  distance  of  2.5  mm  in  200  ps 
is  roughly  160 f),  while  the  fixed  link  cost  due  to  leak¬ 
age  and  clocking  is  ^20  fJ  per  cycle.  The  wire  pitch  is 
only  500  nm,  which  means  that  ten  thou.sand  wires  can 
be  supported  across  the  bisection  of  our  target  chip  even 
with  extra  space  for  power  distribution  and  vias.  Given 


the  abundance  of  on-chip  wiring  res^^urces,  interconnect 
power  dissipation  will  likely  be  a  more  serious  constraint 
than  bisection  bandwidth  for  most  network  topologies. 

We  assume  a  relatively  simple  router  microarehitee- 
ture  which  includes  input  queues,  round-robin  arbitration, 
a  distributed  tristate  crossbar,  and  output  buffers.  The 
routers  in  our  multihop  networks  have  similar  radices,  so 
we  fix  the  router  latency  to  be  two  cycles.  For  a  5x5 
router  with  128  b  flits  of  uniformly  random  data,  we  es¬ 
timate  the  energy  to  be  16pJ/flit.  Notice  that  sending 
a  128  b  flit  across  a  2.5  mm  channel  consumes  roughly 
1 3  pJ,  which  is  comparable  to  the  energy  required  to  move 
this  flit  through  a  simple  router.  Future  on-chip  network 
designs  must  therefore  carefully  consider  both  channel 
and  router  energy,  and  to  a  lesser  extent  area. 

3.2.  Electrical  On-chip  Networks 

Figure  1  illustrates  four  topologies  that  we  will  be 
discussing  in  this  section  and  throughout  the  paper: 
global  crossbars,  two-dimensional  meshes,  concentrated 
meshes,  and  Clos  networks.  Table  I  shows  some  key  pa¬ 
rameters  for  these  topologies  assuming  a  MTBw  system. 

For  systems  with  few  tiles,  a  simple  global  crossbar  is 
one  of  the  most  efficient  network  topologies  and  presents 
a  simple  performance  model  to  software  [8].  Such  cross¬ 
bars  are  strictly  non-blocking;  as  long  as  an  output  is  not 
oversubscribed  every  input  can  send  messages  to  its  de¬ 
sired  output  without  contention.  Small  crossbars  can  have 
very  low- latency  and  high-throughput  but  arc  difficult  to 
scale  to  lens  or  hundreds  of  tiles. 

Figure  la  illustrates  a  64  x  64  crossbar  network  imple¬ 
mented  with  distributed  tristate  buses.  Although  such 
a  network  provides  strictly  non-blocking  connectivity,  it 
also  requires  a  large  number  of  global  buses  across  the 
length  of  the  chip.  These  buses  are  challenging  to  layout 
and  mu.st  be  pipelined  for  good  throughput.  Global  ar¬ 
bitration  can  add  significant  latency  and  also  needs  to  be 
pipelined.  These  global  control  and  data  wires  result  in 
significant  power  consumption  even  for  communication 
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Topology 

Channels 

Routers 

Latency 

Nc 

Nhc 

Nfic  • 

Nr 

radix 

H 

Tr 

Tc 

Tjc 

7b 

Crossbar 

*64 

M28 

*64 

8,192 

1 

64x64 

1 

10 

n/a 

0 

4 

14 

Mesh 

224 

256 

16 

4,096 

64 

5x5 

2-15 

2 

1 

0 

2 

7-46 

CMesh 

48 

512 

8 

4,096 

16 

8x8 

1-7 

2 

2 

0 

1 

3-25 

Clos 

128 

128 

64 

8,192 

24 

8x8 

3 

2 

2-10 

0-1 

4 

14-32 

Tabic  1:  Example  MTBw  Network  Configurations  -  Networks  sized  to  support  I28b/cyclc  per  tile  under  uniform  random 
traffic.  N^-  =  number  of  ehannels,  be  =  bits/ehanncl,  Nfic  =  number  of  bisection  channels,  Nfi  -  number  of  routers,  H  =  number 
of  routers  along  data  paths,  7/?  =  router  latency,  Tc  =  channel  latency,  Trc  =  latency  from  tile  to  first  router,  Ts  =  serialization 
lateney,  7b  =  zero  load  latency.  *  Crossbar  “channels”  arc  the  shared  erossbar  buses. 


between  neighboring  tiles.  Thus  global  electrical  cross¬ 
bars  are  unlikely  choices  for  future  manycore  on-chip  net¬ 
works,  despite  the  fael  that  they  might  be  the  easiest  to 
program. 

Two-dimensional  mesh  networks  (Figure  1  b)  are  popu¬ 
lar  in  systems  with  more  tiles  due  to  their  simplicity  in 
terms  of  design,  wire  routing,  and  decentralized  flow- 
control  (3,  14].  Unfortunately,  high  hop  counts  result 
in  long  latencies  and  significant  energy  consumption  in 
both  routers  and  channels.  Because  network  latency  and 
throughput  arc  critically  dependent  on  application  map¬ 
ping,  low-dimensional  mesh  networks  also  impact  pro¬ 
grammer  produetivity  by  requiring  careful  optimization 
of  task  and  data  placement. 

Moving  from  low-dimensional  to  high -dimensional 
mesh  networks  (e.g.,  4- ary  3-cubes)  reduces  the  network 
diameter,  but  requires  long  channels  when  mapped  to  a 
planar  substrate.  Also,  higher-radix  routers  are  required, 
resulting  in  more  area  and  higher  router  energy.  Instead 
of  adding  network  dimensions,  researchers  have  proposed 
using  concentration  to  help  reduce  hop  count  [1].  Fig¬ 
ure  Ic  illustrates  a  two-dimensional  mesh  with  a  concen¬ 
tration  factor  of  four  (cmesh).  One  of  the  disadvantages  of 
cmesh  topologies  is  that,  for  the  same  theoretical  through¬ 
put,  channels  are  wider  than  an  equivalent  mesh  topology 
as  shown  in  Table  1.  One  option  to  improve  channel  uti¬ 
lization  for  shorter  messages  is  to  divide  resources  among 
multiple  parallel  cmesh  networks  with  narrower  channels. 
The  cmesh  topology  should  achieve  similar  throughput 
as  a  standard  mesh  with  half  the  latency  at  the  cost  of 
longer  channels  and  higher-radix  routers.  CMesh  topolo¬ 
gies  still  require  careful  application  mappings  for  good 
performance. 

Clos  networks  offer  an  interesting  intermediate  point 
between  the  high-radix,  low-diameter  cros.sbar  topology 
and  the  low-radix,  high-diameter  mesh  topK)logy  [4].  Fig¬ 
ure  Id  illustrates  an  8-ary  3-stage  Clos  topology  which 
reduces  the  hop  count  but  requires  longer  point-to-point 
channels.  Figure  2  shows  one  possible  layout  of  this 
topology.  Clos  networks  use  many  small  routers  and  ex¬ 
tensive  path  diversity.  Although  the  specific  Clos  net¬ 
work  shown  here  is  recon figurably  non-blocking  instead 


of  strictly  non-blocking,  we  can  still  minimize  conges¬ 
tion  with  an  appropriate  routing  algorithm  (assuming  the 
outputs  are  not  oversubscribed).  Unfortunately,  Clos  net¬ 
works  still  require  global  point-to-point  channels  and,  as 
with  a  crossbar,  these  global  channels  can  be  difficult  to 
layout  and  have  significant  energy  cost. 

4.  Photonic  On-Chip  Networks 

vSilicon  photonics  is  a  promising  new  technology  which 
offers  lower  fx)wer,  higher  bandwidth  density,  and  shorter 
latencies  than  electrical  interconnects.  Photonics  is  par¬ 
ticularly  effective  for  global  interconnects  and  thus  has 
the  potential  to  enable  scalable  low-diameter  on-chip  net¬ 
works,  which  should  ease  manycore  parallel  program¬ 
ming.  In  this  section,  we  first  introduce  the  underlying 
photonic  technology  before  discussing  the  cost  of  imple¬ 
menting  some  of  the  global  photonic  crossbars  found  in 
the  literature.  We  then  introduce  our  own  approach  to  im¬ 
plementing  a  photonic  Clos  network,  and  compare  its  cost 
to  photonic  crossbars. 

4.1.  Photonic  Technology 

Figure  3  illustrates  the  various  components  in  a  typical 
wavelength-division  multiplexed  (WDM)  photonic  link 
used  for  on-chip  communication.  Light  from  an  off-chip 
two-wavelength  (A|,  Xj)  laser  source  is  carried  by  an  op¬ 
tical  fiber  and  then  coupled  into  an  on-chip  waveguide. 
The  waveguide  carries  the  light  past  a  series  of  transmit¬ 
ters,  each  using  a  resonant  ring  modulator  to  imprint  the 
data  on  the  corresponding  wavelength.  Modulated  light 
continues  through  the  waveguide  to  the  other  side  of  the 
chip  where  each  of  the  two  receivers  use  a  tuned  resonant 
ring  filter  to  “drop”  the  corresponding  wavelength  from 
the  waveguide  into  a  local  photodeteclor.  The  photode- 
tcctor  turns  absorbed  light  into  current,  which  is  sensed 
by  the  electrical  receiver.  Both  3D  and  monolithic  inte¬ 
gration  approaches  have  been  proposed  in  the  past  few 
years  to  implement  silicon -photonic  on-chip  networks. 

With  3D  integration,  a  .separate  sjx^cialized  die  or  layer 
is  used  for  photonic  devices.  Devices  can  be  implemented 
in  monocrystalline  silicon-on-insulator  (Sol)  dies  with 
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Modulator  and  Driver  Circuits 

Receiver  Circuits 

Design 

DDE 

FE 

TTE 

DDE 

FE 

TTE 

ELP 

Aggressive 

20  fJ/bt 

5  fJ/bt 

1 6  fJ/bt/heater 

20  fJ/bt 

5  fJ/bi 

1 6  fJ/bl/hcaler 

3.3  W 

Conservative 

80  fJ/bl 

lOfJ/bl 

32  fJ/bt/heater 

40  fJ/bi 

20  fJ/bi 

32  fj/bt/healer 

33  W 

Table  2:  Aggressive  and  Conservative  Energy  and  Power  Projections  for  Photonic  Devices  -  fJ/bi  =  average  energy  per  bit¬ 
time,  ODE  =  Data-traffic  dependent  energy,  FE  =  Fixed  energy  (clock,  leakage),  TTH  =  Thermal  tuning  energy  (20K  temperature 
range),  ELP  =  Electrical  laser  power  budget  (30%  laser  efficiency). 
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Figure  3:  Photonic  Components  -  Two  point-to-point  pho¬ 
tonic  links  implemented  with  WDM. 


thick  layer  of  buried  oxide  (BOX)  [5],  or  in  a  separate 
layer  of  silicon  nitride  (SiN)  deposited  on  top  of  the  metal 
stack  [2].  In  this  separate  die  or  layer,  customized  pro¬ 
cessing  steps  can  be  used  to  optimize  device  performance. 
However,  this  customized  processing  approach  incrca.ses 
the  number  of  processing  steps  and  hence  manufacturing 
cost.  In  addition,  the  circuits  required  to  interface  the  two 
chips  can  consume  significant  area  and  power. 

With  monolithic  integration,  photonic  devices  are  de¬ 
signed  using  the  existing  process  layers  of  a  standard 
logic  process.  The  photonic  devices  can  be  implemented 
in  polysilicon  on  top  of  the  shallow-trench  isolation  in  a 
standard  bulk  CMOS  process  [9]  or  in  monocrystalline 
silicon  with  advanced  thin  BOX  Sol.  Although  monolithic 
integration  may  require  some  post-processing,  its  manu¬ 
facturing  cost  can  be  lower  than  3D  integration.  Mono¬ 
lithic  integration  decreases  the  area  and  energy  required 
to  interface  electrical  and  photonic  devices,  but  it  requires 
active  area  for  waveguides  and  other  photonic  devices. 

Irrespective  of  the  chosen  integration  methodology, 
WDM  optical  links  have  many  similar  optical  loss  com¬ 
ponents  (see  Table  3).  Optical  loss  affects  system  design, 
as  it  sets  the  required  optical  laser  power  and  correspond¬ 
ingly  the  electrical  laser  power  (at  a  roughly  30%  con¬ 
version  efficiency).  Along  the  optical  critical  path,  some 
losses  such  as  coupler  loss,  non-linearity,  photodetector 
loss,  and  filter  drop  loss  are  relatively  indep)endent  of  the 
network  layout,  size,  and  topology.  For  the  scope  of  this 
study,  we  will  focus  on  the  loss  components  which  signif¬ 
icantly  impact  the  overall  power  budget  as  a  function  of 
the  type,  radix,  and  throughput  of  the  network. 

In  addition  to  optical  loss,  ring  filters  and  modulators 


Photonic  device  Optical  Loss  (dB) 


Optical  Fiber  (per  cm) 

0.5e-5 

Coupler 

1 

Splitter 

0.2 

Non-linearity  (at  30  mW) 

1 

Modulator  Insertion 

0-  1 

Waveguide  (per  cm) 

0-5 

Waveguide  crossing 

0.05 

Filter  through 

le-4-  le-2 

Filler  drop 

1.5 

Photodeteclor 

0.1 

Table  3:  Optical  I^s  Ranges  per  Component 


have  to  be  thermally  tuned  to  maintain  their  resonance 
under  on-die  temperature  variations.  Monolithic  integra¬ 
tion  gives  the  most  optimistic  ring  healing  efficiency  of 
all  approaches  (due  to  in-plane  healers  and  air-undercut), 
estimated  at  1  pW  per  ring  per  K. 

Based  on  our  analysis  of  various  photonic  technolo¬ 
gies  and  integration  approaches,  we  make  the  follow¬ 
ing  assumptions.  With  double-ring  fillers  and  a  4THz 
free-spectral  range,  up  to  128  wavelengths  modulated  at 
10  Gb/s  can  be  placed  on  each  waveguide  (64  in  each  di¬ 
rection,  interleaved  to  alleviate  filler  roll-off  requirements 
and  crosstalk).  A  non-linearity  limit  of  30  mW  at  1  dB 
loss  is  assumed  for  the  waveguides.  The  waveguides  are 
single  mode  and  a  pitch  of  4  pm  minimizes  the  crosstalk 
between  neighboring  waveguides.  The  ring  diameters  are 
f^lOpm.  The  latency  of  a  global  photonic  link  is  assumed 
to  be  3  cycles  (1  cycle  in  flight  and  1  cycle  each  for  E/O 
and  0/E  conversion).  For  monolithic  integration  we  as¬ 
sume  a  5  pm  separation  between  the  photonic  and  elec¬ 
trical  devices  to  maintain  signal  integrity,  while  for  3D 
integration  the  photonic  devices  are  designed  on  a  sepa¬ 
rate  specialized  layer.  Table  2  shows  our  assumptions  for 
the  photonic  link  energy  and  electrical  laser  power. 

4.2.  Photonic  Global  Crossbar  Networks 

A  global  crossbar  provides  non-blocking  all-to-all 
communication  between  its  inputs  and  outputs  in  a  sin¬ 
gle  stage.  Figure  4  shows  two  approaches  for  imple¬ 
menting  a  4x4  photonic  crossbar.  Both  schemes  have 
multiple  single-wavelength  photonic  channels  carried  on 
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Figure  4:  Photonic  4x4  Crossbars  -  Both  crossbars  have 
four  inputs  (I1-4),  four  outputs  (Oi_4),  and  four  channels 
which  are  wavelength  division  multiplexed  onto  the  U-shaped 
waveguide.  Number  next  to  each  ring  indicates  resonant 
wavelength,  (a)  distributed  mux  cros.sbar  (DMXbar)  with  one 
channel  per  output,  (b)  centralized  mux  crossbar  (CMXbar) 
with  one  channel  per  input. 

a  single  waveguide  using  WDM.  Crossbars  with  higher 
radix  and/or  greater  ehannel  bandwidths  will  require  more 
wavelengths  and  more  waveguides.  Both  examples  re¬ 
quire  global  arbitration  to  determine  which  input  can  send 
to  which  output.  Various  arbitration  schemes  are  possible 
including  electrical  and  photonic  versions  of  centralized 
and  distributed  arbitration. 

Figure  4a  illustrates  a  distributed  mux  crossbar 
(DMXbar)  where  there  is  one  channel  per  output  and  ev¬ 
ery  input  can  modulate  every  output  channel.  As  an  ex¬ 
ample,  if  Ii  wants  to  send  a  message  to  O3  it  first  arbitrates 
and  then  modulates  wavelength  A3.  This  light  will  expe¬ 
rience  four  modulator  insertion  losses,  1 3  through  losses, 
and  one  drop  loss.  Notice  that  although  a  DMXbar  only 
needs  one  ring  filter  per  output,  it  requires  0(/J/^)  mod¬ 
ulators  where  r  is  the  crossbar  radix  and  n  is  the  number 
of  wavelengths  per  port.  For  larger  radix  crossbars  with 
wider  channel  bitwidths  the  number  of  modulators  can 
.significantly  impact  optical  power,  thermal  tuning  power, 
and  area.  For  large  distributed-mux  crossbars  this  re¬ 
quires  very  aggressive  photonic  modulator  device  design. 
Vantrease  et  al.  have  proposed  a  global  64  x  64  photonic 
cro.ssbar  which  is  similar  in  spirit  to  the  DMXbar  seheme 
and  requires  about  a  million  rings  [15].  Their  woric  uses 
a  photonic  token  passing  network  to  implement  the  re¬ 
quired  global  arbitration. 

Figure  4b  illustrates  an  alternative  approach  called  a 
centralized  mux  crossbar  (CMXbar)  where  there  is  one 
channel  per  input  and  every  output  can  listen  to  every  in¬ 
put  channel.  As  an  example,  if  I3  wants  to  send  a  mes¬ 
sage  to  Oi  it  first  arbitrates  and  then  modulates  wave¬ 
length  A3.  By  default  all  ring  filters  al  the  receivers  are 
slightly  off-resonance  so  output  Oj  receives  the  message 
by  tuning  in  the  ring  filter  for  A3.  This  light  will  expe- 


Figure  5:  Serpentine  I^ayout  for  64x64  CMXbar  -  Elec¬ 
trical  circuitry  shown  in  red.  64  waveguides  (8  sets  of  8)  arc 
either  routed  between  columns  of  tiles  (monolithic  integra¬ 
tion)  or  over  tiles  (3D  integration).  One  128b/eyclc  channel 
is  mapped  to  each  waveguide,  with  64  A  going  from  left  to 
right  and  64  A  going  from  right  to  left.  Each  tile  modulates  a 
unique  ehannel  and  every  tile  can  receive  from  any  channel. 

rience  one  modulator  insertion  loss,  13  through  losses, 
three  detuned  receiver  through  losses,  and  one  drop  loss. 
If  all  ring  filters  were  always  tuned  in,  then  wavelength  A3 
would  have  to  be  split  among  all  the  outputs  even  though 
only  one  output  is  ever  going  to  actually  receive  the  data. 
Although  useful  for  broadcast,  this  would  drastically  in¬ 
crease  the  optical  power.  A  CMXbar  only  needs  one  mod¬ 
ulator  per  input  (and  so  is  less  sensitive  to  modulator  in¬ 
sertion  loss),  but  it  requires  0(/ir^)  drop  filters.  As  with 
the  DMXbar,  this  can  impact  optical  power,  thermal  tun¬ 
ing  power,  and  area,  and  it  necessitates  aggressive  reduc¬ 
tion  in  the  ring  through  loss.  Additionally,  tuning  of  the 
appropriate  drop  filler  rings  when  receiving  a  message  is 
done  using  charge  injection,  and  this  incurs  a  fixed  over¬ 
head  cost  of  50jLtW  per  tuned  ring.  Kirman  et  al.  inves¬ 
tigated  a  global  bus-based  architecture  which  is  similar 
to  the  CMXbar  scheme  [7].  Nodes  optically  broadcast  a 
request  signal  to  all  other  nodes,  and  then  a  distributed 
arbitration  scheme  allows  all  nodes  to  agree  on  which  re¬ 
ceiver  rings  to  tune  in.  Psota  cl  al.  have  also  proposed  a 
CMXbar-like  scheme  which  focuses  on  supporting  global 
broadcast  where  all  receivers  arc  always  tuned  in  [12]. 

Although  Figure  4  shows  two  of  the  more  common 
approaches  proposed  in  the  literature,  there  are  other 
schemes  which  use  a  significantly  different  implemen¬ 
tation.  Zhou  et  al.  describe  an  approach  which  replaces 
the  U-shaped  waveguide  with  a  matrix  of  passive  ring  fil¬ 
ters  [17].  This  approach  still  requires  either  multiple  mod- 
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ulalors  per  input  or  multiple  ring  filters  per  output,  but  re¬ 
sults  in  shorter  waveguide  lengths  since  all  wavelengths 
do  not  need  to  pass  by  all  tiles.  Unfortunately,  the  matrix 
also  increases  the  number  of  rings  and  waveguide  cross¬ 
ings.  Petracca  et  al.  describe  a  crossbar  implementation 
which  leverages  photonic  switching  elements  that  switch 
many  wavelengths  with  a  single  ring  resonator  [10].  Their 
scheme  requires  an  electrical  control  network  to  config¬ 
ure  these  photonic  switching  elements,  and  thus  is  best 
suited  for  transmitting  very  long  messages  which  amor¬ 
tize  configuration  overhead.  In  this  paper,  we  focus  on  the 
schemes  illustrated  in  Figure  4  and  leave  a  detailed  com¬ 
parison  to  more  complicated  crossbars  for  future  work. 

The  DMXbar  and  CMXbar  schemes  can  be  extended 
to  much  larger  systems  in  a  variety  of  ways.  A  naive  ex¬ 
tension  of  the  CMXbar  scheme  in  Figure  4b  is  to  layout 
a  global  loop  around  the  chip  with  light  always  traveling 
in  one  direction.  Unfortunately  this  layout  has  an  optical 
critical  path  which  would  traverse  the  loop  twice.  Figure  5 
shows  a  more  efficient  serpentine  layout  of  the  CMXbar 
scheme  for  our  target  system  of  64  tiles.  This  crossbar 
has  128b/cycle  input  px)rts  which  makes  it  suitable  for 
a  MTBw  system  (i.e.,  128b/cycle  per  tile  under  uniform 
random  traffic).  At  a  5  GHz  clock  rate,  each  channel  uses 
64  A  (lOGb/s/A),  and  we  need  a  total  of  64  waveguides 
(1  waveguidc/channel).  An  input  can  send  light  in  cither 
direction  on  the  waveguides,  which  shortens  the  optical 
critical  path  but  requires  additional  modulators  per  input. 

The  total  power  dissipated  in  the  on-chip  photonic  net¬ 
work  can  be  divided  into  two  components.  The  first  com¬ 
ponent  consists  of  power  dissipated  in  the  photonic  com¬ 
ponents,  i.e.,  power  at  the  laser  source  and  the  power  dis¬ 
sipated  in  thermal  tuning.  The  second  part  consists  of 
electrical  power  dissipated  in  the  modulator  driver,  re¬ 
ceiver,  and  arbitration  circuits.  Here  we  quantify  the  first 
power  component  and  then  in  Section  5  we  provide  a  de¬ 
tailed  analysis  of  the  second  ix)wer  component. 

The  optical  losses  experienced  in  the  various  optical 
components  and  the  desired  network  capacity  determine 
the  total  optical  power  needed  at  the  laser  source.  In  the 
serpentine  layout  of  a  CMXbar,  the  waveguide  and  ring 
through  loss  are  the  dominant  loss  components,  due  to 
the  long  waveguides  (9.5  cm)  and  large  number  of  rings 
(128  modulator  rings  and  63  x  64  =  4032  filter  rings) 
along  each  waveguide.  Figure  6  shows  two  contour  plots 
of  the  optical  power  required  at  the  laser  source  for  the 
LTBw  and  HTBw  systems  with  a  photonic  CMXbar  net¬ 
work.  For  a  given  value  of  waveguide  loss  and  through 
loss  per  ring,  the  number  of  wavelengths  per  waveguide  is 
the  same  for  the  two  systems.  However,  the  higher  band¬ 
width  system  requires  wider  global  buses  which  increases 
the  optical  power  required  at  the  laser  source.  As  a  result, 
the  LTBw  system  can  tolerate  higher  losses  per  compx)- 
nent  compared  to  the  HTBw  system  for  the  same  optical 


Waveguide  loss  (dB/cm)  Waveguide  loss  (dB/cm) 


(a)  LTBw  (b)  HTBw 

Figure  6:  Laser  Optical  Power  (W)  (top  row)  and  I’ercent 
Area  (bottom  row)  for  64  x  64  CMXbar  -  Systems  imple¬ 
mented  with  serpentine  layout  on  20x20 mm  die. 


Global  Crossbar 

Clos 

System 

Rings 

Power 

Rings 

Power 

LTBw 

266  k 

5.3  W 

14k 

0.28  W 

HTBw 

1,000k 

21.3  W 

57  k 

1.14  W 

Table  4:  Thermal  Power  -  Power  required  to  thermally  tune 
the  rings  in  the  network  over  a  temperature  range  of  20K. 

power  budget. 

Figure  6  shows  contour  plots  of  the  percent  area  re¬ 
quired  for  the  optical  devices  for  the  LTBw  and  HTBw 
systems.  The  non-linearity  limit  affects  the  number  of 
wavelengths  that  can  be  routed  on  each  waveguide  and 
hence  the  number  of  required  waveguides,  making  pho¬ 
tonic  device  area  dependent  on  optical  loss.  As  expected, 
the  HTBw  system  requires  increased  photonic  area  for 
each  loss  combination.  There  is  a  lower  limit  on  the  area 
overhead  which  occurs  when  all  of  the  wavelengths  per 
waveguide  are  utilized.  The  minimum  area  for  the  LTBw 
and  HTBw  systems  is  6%  and  23%,  respectively. 

To  calculate  the  required  power  for  thermal  tuning,  we 
assume  that  under  typical  conditions  the  rings  in  the  sys¬ 
tem  would  experience  a  temperature  range  of  20  K.  Ta¬ 
ble  4  shows  the  power  required  for  thermal  tuning  in  the 
crossbar.  Although  each  modulator  and  ring  filter  uses 
two  cascaded  rings,  we  assume  that  these  two  rings  can 
share  the  same  heater.  The  laige  number  of  rings  in  the 
crossbar  significantly  increases  both  thermal  tuning  and 
area  overheads. 

We  can  use  a  similar  serpentine  layout  as  the  one  shown 
in  Figure  5  to  implement  a  DMXbar.  There  would  be 
one  output  tile  per  waveguide  and  there  would  be  no 
need  to  tune  or  detune  the  drop  filters.  We  would,  how¬ 
ever,  require  a  large  number  of  modulators  per  waveguide 
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(a)  Clos  with  Photonic  Point-to-l^)inl  Channels 
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(b)  Clos  with  Photonic  Middle  Routers 


Figure  7:  Photonic  2-ary  3-stage  Clos  Networks  -  Both 
networks  have  four  inputs  (li  4),  four  outputs  (Oj  -  4),  and 
six  2x2  routers  (Ro-2,0-i)-  (2)  four  poinl-tO'p(Mnt  photonic 
channels  use  WDM  on  each  U-shapcd  waveguide,  (b)  the  two 
middle  routers  (Ri.o- 1)  are  implemented  with  photonic  2x2 
CMXbars  on  a  single  U-shaped  waveguide.  Number  next  to 
each  ring  indicates  resonant  wavelength. 


(63  X  64  =  4032)  and  modulator  insertion  loss  would  most 
likely  dominate  the  optical  power  loss.  For  this  topology 
to  be  feasible,  novel  modulators  with  close  to  OdB  inser¬ 
tion  loss  need  to  be  designed.  The  area  for  photonic  de¬ 
vices  and  power  dissipated  in  thermally  tuning  the  rings 
would  be  similar  to  that  in  the  CMXbar  implementation. 

The  large  number  of  rings  required  for  photonic  cross¬ 
bar  implementations  make  monolithic  integration  imprac¬ 
tical  from  an  area  perspective,  and  3D  integration  is  ex¬ 
pensive  due  to  the  power  cost  of  thermal  tuning  (even 
in  the  case  when  all  the  circuits  of  the  inactive  transmit- 
ters/reeeivers  can  be  fully  powered  down).  The  actual  cost 
of  these  crossbar  networks  will  be  even  higher  than  in¬ 
dicated  in  this  section  since  we  have  not  accounted  for 
arbitration  overhead.  These  observations  motivate  our  in¬ 
terest  in  photonic  Clos  networks  which  preserve  much  of 
the  simplicity  of  the  crossbar  programming  model,  while 
significantly  reducing  area  and  power. 

4.3.  Photonic  Clos  Networks 

As  described  in  Section  3.2,  a  Clos  network  uses  multi¬ 
ple  stages  of  small  routers  to  create  a  larger  non-blocking 
all-lo-all  network.  Figure  7  shows  two  approaches  for  im¬ 
plementing  a  2-ary  3-stage  Clos  network.  In  Figure  7a,  all 
of  the  Clos  routers  are  implemented  electrically  and  the 
inter-router  channels  are  implemented  with  photonics.  As 
an  example,  if  input  I2  wants  to  communicate  with  out- 


Figure  8:  U-Shaped  Layout  for  8-ary  3-stagc  Clos  -  Elec¬ 
trical  circuitry  shown  in  rcd.  56  waveguides  (8  sets  of  7)  are 
either  routed  between  columns  of  tiles  (monolithic  integra¬ 
tion)  or  over  tiles  (3D  integration).  Each  of  the  8  clusters 
(8  tiles  per  cluster)  has  electrical  channels  to  its  router  group 
which  contains  one  router  per  Clos  stage.  In  the  inset,  the 
first  set  of  7  waveguides  are  used  for  channels  (each  64  A  = 
128b/cycle)  connecting  to  and  from  every  other  cluster.  The 
second  set  of  7  waveguides  arc  used  for  the  second  half  of 
the  Qos  network.  The  remaining  42  waveguides  are  used  for 
point-to-point  channels  between  other  clusters. 


pul  O4  then  it  can  use  either  middle  router.  If  the  routing 
algorithm  chooses  Ri  1,  then  the  network  will  use  wave¬ 
length  A2  on  the  first  waveguide  to  send  the  message  to 
Ri  1  and  wavelength  A4  on  the  second  waveguide  to  send 
the  message  to  O4.  Figure  7b  is  logically  the  same  topol¬ 
ogy,  but  each  middle  router  is  implemented  with  photonic 
CMXbar.  The  channels  for  both  crossbars  are  multiplexed 
onto  the  same  waveguide  using  WDM.  Note  that  we  still 
use  electrical  buffering  and  arbitration  for  these  photonic 
middle  routers.  Using  photonic  instead  of  electrical  mid¬ 
dle  routers  removes  one  stage  of  EOF  conversion  and  can 
potentially  lower  the  dynamic  power  of  the  middle  router 
crossbars,  but  at  the  cost  of  higher  optical  and  thermal 
tuning  power.  Depending  on  photonic  device  losses,  this 
tradeoff  may  be  beneficial  since  for  our  tai^et  system  the 
radix  of  the  Clos  routers  (8x8)  is  relatively  low.  In  this 
paper,  we  focus  on  the  Clos  with  photonic  point-to-point 
channels  since  it  should  have  the  lowest  optical  power, 
thermal  tuning  power,  and  area  overhead. 

As  in  the  crossbar  case,  there  are  multiple  ways  to  ex¬ 
tend  this  smaller  Clos  network  to  larger  systems.  For  a 
fair  comparison,  we  keep  the  same  packaging  constraints 
(i.e.,  location  of  vertical  couplers)  and  also  try  to  use 
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5.  Simulation  Results 


Waveguide  loss  (dB/cm)  Waveguide  loss  (dB/cm) 


(a)  LTBw  (b)  HTBw 

Figure  9:  Laser  Optical  Power  (W)  (lop  row)  and  Percent 
Area  (bottom  row)  for  8-ary  3-stagc  Clos  -  Systems  imple¬ 
mented  with  U-shaped  layout  on  20  x  20  mm  die. 

the  light  from  the  laser  most  efficiently.  Figure  8  shows 
the  U-shaped  layout  of  the  photonic  Clos  network  in  a 
MTBw  system,  which  corresponds  to  64  A  per  channel. 
Each  point-to-point  photonic  channel  uses  either  forward 
or  backward  propagating  wavelengths  depending  on  the 
physical  location  of  the  source  and  destination  clusters. 

In  a  Clos  network,  the  waveguide  and  ring  through 
losses  contribute  significantly  to  the  total  optical  loss  but 
to  a  lesser  extent  than  in  a  crossbar  network,  due  to  shorter 
waveguides  and  less  rings  along  each  waveguide.  All  the 
waveguides  in  the  Clos  network  are  roughly  2x  shorter 
and  with  20  x  less  rings  along  each  waveguide  compared 
to  a  crossbar  network.  Figure  9  shows  the  optical  power 
contours  for  the  Clos  network. 

Although  the  number  of  optical  channels  in  the  Clos 
network  is  higher  than  in  the  crossbar  network,  the  to¬ 
tal  number  of  rings  (for  same  bandwidth)  is  significantly 
smaller  since  optical  channels  are  point-to-point,  resulting 
in  significantly  smaller  tuning  (Table  4)  and  area  costs. 
The  area  overhead  shown  in  Figure  9  is  much  smaller  than 
for  a  crossbar  due  to  shorter  waveguides  and  smaller  num¬ 
ber  of  rings  and  is  well  suited  for  monolithic  integration 
with  a  wider  range  of  device  losses.  The  lower  limit  on 
the  area  overhead  is  2%  and  8%  for  LTBw  and  HTBw, 
respectively. 

Based  on  this  design-space  exploration  we  propose  us¬ 
ing  the  photonic  Clos  network  for  on-ehip  communica¬ 
tion.  Clos  networks  have  lower  area  and  thermal  tuning 
costs  and  higher  tolerance  of  photonic  device  losses  as 
compared  to  global  photonic  crossbars.  In  the  next  sec¬ 
tion  we  compare  this  photonic  Clos  network  with  electri¬ 
cal  implementations  of  mesh,  cmesh,  and  Clos  networks 
in  terms  of  throughput,  latency,  and  power. 


In  this  section,  we  use  a  detailed  cycle-accurate  mi- 
croarchiieciural  simulator  to  study  the  performance  and 
power  of  various  electrical  and  photonic  networks  for 
a  64-lile  system  with  512  b  messages.  Our  mcxlcl  in¬ 
cludes  pipeline  latencies,  router  contention,  flow  con¬ 
trol,  and  serialization  overheads.  Warm-up,  measure,  and 
drain  phases  of  several  thousand  cycles  and  infinite  source 
queues  were  used  to  accurately  determine  the  latency  at 
a  given  injection  rale.  Various  events  (c.g.,  channel  uti¬ 
lization,  queue  accesses,  arbitration)  were  counted  during 
simulation  and  then  multiplied  by  energy  values  derived 
from  first-order  gale-level  models. 

Our  baseline  includes  three  electrical  networks:  a  2D 
mesh  (emesh),  a  mesh  with  a  eoneentralion  factor  of  four 
{ecmeshjc2),  and  an  8-ary  3-slage  Clos  (eclos).  Because 
a  single  concentrated  mesh  would  have  channel  bit  widths 
larger  than  our  message  size  for  some  configurations,  we 
implement  two  parallel  emeshes  with  narrow  channels 
and  randomly  interleave  messages  between  them.  We  also 
study  a  photonic  implementation  of  the  Clos  network  (/x  - 
los)  with  aggressive  (pclos-a)  and  conservative  (pclos-c) 
photonic  devices  (see  Table  2).  We  show  results  for  LTBw 
and  HTBw  systems  which  correspond  to  ideal  through¬ 
puts  of  64  b/eyele  and  256  b/eyele  per  tile  for  uniform  ran¬ 
dom  traffic.  Our  mesh  networks  use  dimension-ordered 
routing,  while  our  Clos  networks  use  a  randomized  oblivi¬ 
ous  routing  algorithm  (i.e.,  randomly  choosing  the  middle 
router).  All  networks  use  wormhole  flow  control. 

We  use  synthetic  traffic  patterns  based  on  a  partitioned 
application  model.  Each  traffic  pattern  has  some  num¬ 
ber  of  logical  partitions,  and  tiles  randomly  communicate 
only  with  other  tiles  that  are  in  the  same  partition.  These 
logical  partitions  are  then  mapped  to  physical  tiles  in  ei¬ 
ther  a  co-localcd  fashion  (tiles  within  a  partition  arc  phys¬ 
ically  grouped  together)  or  in  a  distributed  fashion  (tiles 
in  a  partition  are  distributed  across  the  chip).  We  believe 
these  partitioned  traffic  patterns  capture  the  varying  local¬ 
ity  present  in  manyeore  programs.  Although  we  studied 
various  partition  sizes  and  mappings,  we  focus  on  the  fol¬ 
lowing  four  representative  patterns  in  this  paper.  A  single 
global  partition  is  identical  to  the  standard  uniform  ran¬ 
dom  traffic  pattern  (UR).  The  P8C  pattern  has  eight  par¬ 
titions  each  with  eight  tiles  optimally  eo- located  together. 
The  P8D  pattern  stripes  the.se  partitions  across  the  chip. 
The  P2D  pallcm  has  32  partitions  each  with  two  tiles,  and 
these  two  tiles  are  mapped  to  diagonally  opposite  quad¬ 
rants  of  the  chip. 

Figure  10  shows  the  latency  versus  offered  bandwidth 
for  the  LTBw  and  HTBw  systems  with  different  traffic 
patterns.  In  both  emesh  and  ecmeshxl^  the  P8C  traf¬ 
fic  pattern  requires  only  local  communication  and  thus 
has  higher  performance.  The  P2D  traffic  pattern  re¬ 
quires  global  communication  which  results  in  lower  per- 
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Figure  10;  Latency  vs.  Offered  Bandwidth  -  LTBw  sys¬ 
tems  have  a  theoretical  throughput  of  64  b/cyclc  per  tile  for 
UR;  corresponding  for  HTBw  is  256b/cycle. 


formanee.  On  average,  ecmeshxl  saturates  at  higher 
bandwidths  than  emesh  due  to  the  path  diversity  pro¬ 
vided  by  the  two  emesh  networks,  and  has  lower  la¬ 
tency  due  to  lower  average  hop  count.  Although  not 
shown  in  Figure  10,  the  eclos  network  has  similar  satura¬ 
tion  throughput  to  pclos  but  with  higher  average  latency. 
Because  pclos  always  distributes  traffic  randomly  across 
its  middle  routers,  it  has  uniform  latency  and  through¬ 
put  across  all  traffic  patterns.  Note,  however,  that  pc¬ 
los  performs  better  than  emesh  and  emeshxl  on  global 
traffic  patterns  (e.g.,  P2D)  and  worse  on  local  traffic  pat¬ 
terns  (e.g.,  P8C).  If  the  pclos  power  consumption  is  low 
enough  for  the  LTBw  system  then  we  should  be  able  to  in¬ 
crease  the  size  to  a  MTBw  or  HTBw  system.  A  larger  pc¬ 
los  network  will  hopefully  have  similar  performance  and 
energy-effieieney  for  local  traffic  patterns  as  compared  to 
emesh  and  ecmeshxl  and  much  better  performance  and 
energy-cflicicncy  for  global  traffic  patterns. 

Figure  1 1  shows  the  power  dissipation  versus  offered 
bandwidth  for  various  network  topologies  with  the  P8C 
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Figure  11;  Power  Dissipation  vs.  Offered  Bandwidth  - 
3.3  W  laser  power  not  included  for  the  pclos-a  topology. 


and  PSD  traffic  patterns.  In  order  to  match  the  perfor¬ 
mance  of  ecmeshxl  LTBw  system  we  need  to  use  the 
pclos-a  MTBw  system  which  has  slightly  higher  power 
for  the  P8C  traffic  pattern  (local  communication)  and 
much  lower  power  for  the  PSD  traffic  pattern  (global  com¬ 
munication)  assuming  we  arc  at  medium  to  high  load. 
La.scr  power  is  not  included  in  Figure  1 1  which  may  be 
appropriate  for  systems  primarily  limited  by  the  power 
density  of  the  processor  chip,  but  may  not  be  appropriate 
for  energy- constrained  systems  or  for  systems  limited  by 
the  total  power  consumption  of  the  motherboard. 

Figure  12  shows  the  power  breakdowns  for  vari¬ 
ous  topologies  and  traffic  patterns,  for  both  LTBw  and 
HTBw  design  points  that  ean  support  the  desired  of¬ 
fered  bandwidth  with  lowest  power.  Compared  to  emesh 
and  ecmeshxl,  the  pclos-a  network  provides  compara¬ 
ble  performance  and  low  power  dissipation  for  global 
traffic  patterns,  and  comparable  performance  and  power 
dissipation  for  local  traffic  patterns.  The  pclos-a  net¬ 
work  energy-effieieney  increases  when  sized  for  higher 
throughputs  (higher  utilization)  due  to  static  laser  power 
component.  More  importantly,  the  pclos-a  network  offers 
a  global  low-dimensional  network  with  uniform  perfor¬ 
mance  which  should  simplify  manyeore  parallel  program¬ 
ming.  The  energy  efficiency  of  pclos  network  might  be 
further  improved  by  investigating  alternative  implemen¬ 
tations  which  use  photonic  middle  switch  router  as  shown 
in  Figure  7b. 

It  is  important  to  note  that  with  conservative  optical 
technology  projections,  even  in  relatively  simple  optical 
network  like  pclos,  the  required  electrical  laser  power  is 
much  larger  than  other  components,  and  the  photonic  net¬ 
work  will  usually  consume  higher  power  than  the  elec¬ 
trical  networks.  This  strong  coupling  between  overall 
network  performance,  topology  and  underlying  photonic 
components  underlines  the  need  for  a  fully  integrated  ver¬ 
tical  design  approach  illustrated  in  this  paper. 
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Figure  12:  Dynamic  Power  Breakdown  -  Power  of  ec- 
los  and  pclos  did  not  vary  significantly  across  traffic  pat¬ 
terns.  (a)  LTBw  systems  at  2  kb/cyele  offered  bandwidth  (ex¬ 
cept  for  emesh/p2d  and  ecmeshjc2/p2d  which  saturated  before 
2  kb/cycle,  HTBw  system  shown  instead),  (b)  HTBw  systems 
at  8  kb/cycle  offered  bandwidth  (except  for  emesh/p2d  and 
ecmeshx2/p2d  which  saturated  before  8  kb/cycle). 


6.  Conclusion 

We  have  proposed  and  evaluated  a  silieon-photonie 
Clos  network  for  global  on-chtp  communication.  Since 
the  Clos  network  uses  point-to-point  channels  instead  of 
the  global  shared  channels  found  in  crossbar  networks, 
our  photonic  Clos  implementations  consume  significantly 
less  optical  power,  thermal  tuning  power,  and  area  over¬ 
head,  while  imposing  less  aggressive  loss  requirements 
on  photonic  devices.  Our  simulations  show  that  the  result¬ 
ing  photonic  Clos  networks  should  provide  higher  energy- 
effieiency  than  eleetrieal  implementations  of  mesh  and 
Clos  networks  with  equivalent  throughput.  A  unique  fea¬ 
ture  of  a  photonic  Clos  network  is  that  it  provides  uni¬ 
formly  low  latency  and  uniformly  high  bandwidth  regard¬ 
less  of  traffic  pattern,  which  helps  reduce  the  program¬ 
ming  challenge  introduced  by  highly  parallel  systems. 
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ABSTRACT 

Future  single-board  multi-socket  systems  may  be  unable  to 
deliver  the  needed  memory  bandwidth  electrically  due  to 
power  limitations,  which  will  hurt  their  ability  to  drive  per¬ 
formance  improvements.  Energy  efficient  off-chip  silicon 
photonics  could  be  used  to  deliver  the  needed  bandwidth, 
and  it  could  be  extended  on-chip  to  create  a  relatively  flat 
network  topology.  That  flat  network  may  make  it.  possi¬ 
ble  to  implement  the  same  number  of  cores  with  a  greater 
number  of  smaller  dies  for  a  cost  advantage  with  negligible 
I>erformance  degradation. 

Categories  and  Subject  Descriptors:  B.4.3  [Computer 
Systems  Organization]:  Processor  Architectures[Parallel  Ar¬ 
chitectures] 

General  Terms:  Design,  I^conoinics,  Performance 
Keywords:  Silicon  Photonics,  Multi-socket 

1.  INTRODUCTION 

Given  the  difficulties  of  scaling  uniprocessor  performance 
further,  most  commercial  microprocessor  manufacturers  have 
instead  used  increased  transistor  densities  to  integrate  mul¬ 
tiple  processor  cores  on  one  die  [1].  To  deliver  further  perfor¬ 
mance  improvements,  multi-socket  systems  have  been  used 
to  increase  the  computing  power  and  memory  capacity.  These 
multi-socket  systems  will  require  increasing  memory  band¬ 
width  to  deliver  realizable  improvements  in  application  per¬ 
formance.  This  bandwidth  must  come  not  only  from  con¬ 
nections  to  DRAM,  but  also  from  inter-socket  links.  Even 
if  the  bandwidth  to  these  systems  is  not  hampered  by  pin 
limitations,  it  will  be  restricted  by  power  limitations  from 
electrical  off-chip  signalling. 

Silicon  photonics  could  be  used  off-chip  to  solve  this  band¬ 
width  problem,  with  its  great  potential  for  energy  efficiency 
and  bandwidth  density.  If  photonics  is  used  for  the  inter¬ 
socket  links,  it  could  also  be  extended  on-chip  closer  to  its 
destinations.  In  this  work  we  present  a  scalable  interconnect 
based  on  monolithically  integrated  silicon  photonics  that  is 
able  to  harness  the  technology’s  potential  to  create  an  uni¬ 
form  network  topology.  With  an  approximately  flat  multi¬ 
socket  interconnect,  the  penalty  for  communicating  between 
sockets  is  reduced,  which  may  enable  potential  cost  bene¬ 
fits  from  implementing  the  same  aggregate  die  area  over  a 
greater  number  of  smaller  dies. 

Copyright  is  held  by  the  author/owncr(s). 

ICS’09,  June  8-12,  2009,  Yorktown  Heights,  New  York,  USA. 

ACM  978-1-60558-498-0/09/06. 


Christopher  Batten,  Ajay  Joshi, 

Vladimir  Stojanovic 
Department  of  Electrical  Engineering  and 
Computer  Science 

Massachusetts  Institute  of  Technology, 

Cambridge.  Massachusetts 

{ebatten,  joshi,  vlada}@mit.edu 

2.  SILICON  PHOTONICS  POTENTIAL 

Silicon  photonics  has  emerge<J  in  recent  years  as  an  ap¬ 
pealing  way  to  enable  high  band  widths  without  excessive 
area  or  power  requirements  [3,  5,  6].  Due  to  the  diversity  in 
prospective  photonic  technologies,  we  selected  a  particular 
proposal  for  monolithically  integrated  silicon  photonics  [2, 
4]  to  base  our  design  and  its  evaluation  on. 

Since  photonics  uses  light  rather  than  electricity  to  trans¬ 
mit  data,  transmitted  bits  must  undergo  conversion  at  both 
ends  (electro-optical  and  opto-electrical)  which  adds  a  con¬ 
stant  latency  and  energy  penalty.  Because  those  penalties 
are  constant,  photonics  excels  over  a  distance  due  to  greater 
amortization.  The  most  compelling  advantages  for  silicon 
photonics  over  forecasted  electrical  interconnects  are  its  high 
bandwidth  density  and  energy  efficiency  for  off-chip  com¬ 
munication.  On-chip  the  selected  technology  performs  well 
under  the  same  metric.s,  but  only  if  it  travels  a  non-trivial 
distance.  Using  a  coupler,  it  is  possible  to  guide  light  from 
off-chip  fibers  onto  on-chip  waveguides  without  retransmis¬ 
sion  or  modification.  This  enables  seamless  inter-chip  links 
to  be  made,  since  if  the  constant  conversion  overhead  is  go¬ 
ing  to  be  paid  for  off-chip  links,  it  makes  sense  to  traverse 
the  remaining  on-chip  portion  optically  as  well  since  it  is 
nearly  free  [2]. 

For  the  selected  technology,  laser  light  is  generated  in  bulk 
off-chip  and  carried  by  fiber  to  splitters  on-chip  where  it 
will  be  directed  to  the  various  links.  Power  is  consumed 
by  photonic  links  at  the  endpoints  on-chip  for  signaling  as 
well  as  the  off-chip  light  source.  Along  the  path  from  the 
transmitter  to  the  receiver,  there  are  various  types  of  losses 
the  signal  incurs,  and  sufficient  hiser  power  must  be  appHixI 
to  compensate.  The  optical  critical  path  is  the  path  with 
the  most  loss  from  the  light  source  to  the  last  receiver,  and 
it  dictates  how  much  laser  power  the  system  will  need. 

3.  ARCHITECTURE 

For  this  research  we  target  single- board  multi-socket  sys¬ 
tems,  and  our  design  leverages  the  potential  of  photonics  to 
produce  a  flat  network.  These  boards  could  be  connected  to¬ 
gether  by  another  network  to  create  an  even  larger  system, 
but  within  a  board  a  core  sees  uniform  memory  performance. 
Since  electrical  interconnects  are  advantageous  over  short 
distances,  we  electrically  join  groups  of  cores  (4  16)  into 
clusters  by  shared  L2  caches.  These  clusters  are  connected 
by  dedicated  photonic  links  to  every  memory  controller  (Fig¬ 
ure  1).  Fully  connected  networks  are  often  avoided  because 
of  their  quadratic  growth  in  resource  consumption,  but  the 
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Figure  1:  Topology  for  two  clusters  of  four  cores 
with  four  memory  controllers 


Ote  A  Die  B 


Figure  2:  Logical  view  of  a  two  die  system 

energy  efficiency  and  the  bandwidth  density  advantages  of 
silicon  photonics  make  it  tolerable  for  a  small  system.  A 
memory  controller  may  actually  communicate  with  multiple 
DRAM  channels,  but  from  the  point  of  view  of  the  network, 
it  is  simply  a  point  of  arbitration  for  access  to  that  mem¬ 
ory.  The  links  between  the  DRAM  modules  and  the  memory 
controllers  are  electrical  because  of  the  challenges  involved 
with  changing  the  DRAM  interface,  but  future  work  could 
benefit  greatly  if  these  connections  were  photonic. 

The  simple  network  topology  was  not  only  chosen  to  make 
a  flat  network,  but  to  also  enable  a  single  die  design  to  be 
used  in  varying  quantities  to  make  a  scalable  range  of  sys¬ 
tems.  In  this  glueless  system,  a  cluster’s  memory  bandwidth 
is  uniformly  spread  across  all  the  memory  controllers,  so  in 
the  maximum  supp)orted  system  size  there  is  one  direct  chan¬ 
nel  between  each  cluster  and  each  memory  controller.  For 
systems  with  less  populated  sockets,  each  cluster  will  get  the 
same  total  bandwidth,  but  it  will  have  multiple  channels  to 
each  memory  controller.  To  enable  this  flexibility,  the  ac¬ 
tual  connections  between  memory  controllers  and  clusters 
are  done  off  chip  (Figure  2)  so  the  changes  necessary  for 
systems  of  different  sizes  are  localized  to  small  off-chip  com¬ 
ponents,  To  simplify  the  packaging  and  assembly  of  all  the 
point-to-point  connections,  off-chip  fibers  are  grouped  into 
ribbons,  which  connect  to  a  star  coupler.  The  star  coupler 
is  a  passive  device  that  connects  two  groups  of  ribbons  such 
that  each  ribbon  has  at  least  one  fiber  directly  coupled  with 
a  fiber  from  every  ribbon  in  the  other  group.  Our  design 
template  is  general  enough  that  it  is  able  to  scale  down  to 
smaller  dies  while  maintaining  the  same  topology  and  nearly 
identical  performance. 

4.  INCENTIVES  FOR  DISINTEGRATION 

Using  a  greater  number  of  smaller  dies  to  implement  the 
same  silicon  area  could  have  cost  advantages.  Smaller  dies 


should  benefit  from  higher  yield  rates  and  increased  toler¬ 
ance  to  pro<*e.ss  variation,  since  they  could  be  binned  on 
finer  granularities.  A  single  reusable  design  will  also  have  a 
higher  sales  volume,  which  will  reduce  non-n'curring  engi¬ 
neering  (NRE)  costs.  This  disintegration  is  made  w'ortliwhile 
by  photonics,  because  otherwise  it  will  increase  the  number 
of  elect  rical  pins  and  power  spent  on  the  interconnect.  For 
our  design,  smaller  dies  will  allow  the  system  to  be  more 
spread  out,  which  will  reduce  the  power  density  and  make 
it  easier  to  electrically  attach  DRAM.  Fixed  costs  per  die 
(testing,  packaging,  and  assembly)  will  cause  penalties  for 
using  dies  that  are  too  small,  but  the  optimum  die  size  for 
cost  may  be  smaller  than  current  commercial  designs. 

5.  RESULTS 

Using  our  candidate  tec  hnology,  we  evaluated  the  general 
design  while  varying  the  die  size  (16-256  cores/die)  and  the 
maximum  supported  system  size  (64  1024  cores).  To  scale 
to  higher  core  counts  will  require  a  multi-hop  network.  The 
layout  of  each  design  was  optimized  to  reduce  the  optical 
critical  path  loss  because  laser  p>ower  can  be  the  majority 
power  consumer  of  a  photonic  interconnect.  The  area  taken 
by  the  on-chip  interconnect  was  always  less  than  10%,  and 
the  latency  stayed  roughly  constant  since  the  network  topol¬ 
ogy  stayed  the  same.  Interestingly,  for  the  range  of  designs 
explored,  independent  of  the  total  numbers  of  cores,  systems 
with  a  modest  number  of  dies  (4-8)  had  the  lowest  optical 
power. 

6.  CONCLUSION 

Silicon  photonics  provides  an  appealing  way  to  supply 
the  bandwidth  needed  to  drive  multi-socket  systems,  and 
a  range  of  scalable  designs  capable  of  supporting  up  to  1024 
cores  with  uniform  memory  bandwidth  was  presented.  In  a 
relatively  flat  network  like  the  one  presented,  silicon  photon¬ 
ics  sufficiently  reduces  the  barrier  to  going  off-chip  such  that 
future  die  sizes  may  be  chosen  by  what  is  most  cost  efficient 
rather  than  what  is  most  reasonable  to  manufacture. 
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Abstract 

To  fuel  aji  increasing  need  for  parallel  performance,  system  designers  have  resulted  to  using  inulti- 
I)le  sockets  to  provide  more  hardware  parallelism.  These  multisocket  systems  have  limited  ofTchip 
bandwidth  due  to  their  electrical  interconnect  which  is  both  power  and  pin  limited.  Current  sys- 
terns  often  use  of  a  Non-Uniform  Memory  Architecture  (NUMA)  to  get  the  most  system  memory 
bandwidth  from  limited  off-chip  bandwidth.  A  NUMA  system  complicates  the  work  of  a  perfor¬ 
mance  programmer  or  operating  system,  because  they  must  maintain  data  locality  to  maintain 
])erforniance. 

Silicon  photonics  is  an  emerging  technology  that  promises  great  off-chip  bandwidth  density 
and  energy  efficiency  when  compared  to  electrical  signaling.  With  this  abundance  of  bandwidth, 
it  will  be  possible  to  build  a  relatively  flat,  high  bandwidth  memory  interconnect.  Because  this 
interconnect  has  uniform  bandwidth,  NUMA  optimizations  will  be  unnecessary,  which  increases 
performance  programmer  productivity. 

If  the  penalties  to  making  a  multi-socket  system  are  negated  by  the  use  of  silicon  photonics, 
there  is  less  incentive  to  integrate,  and  economic  incentives  to  disintegrate.  In  this  thesis,  we 
present  this  scalable  and  coherent  multi-socket  design  along  with  discussing  the  tradeoffs  facing  an 
architect  when  incorporating  silicon  photonics  technology. 


Chapter  1 

Introduction 


Given  the  difficulties  of  scaling  uniprocessor  performance  further,  most  commercial  microprocessor 
manufacturers  have  instead  used  increased  transistor  densities  to  integrate  multiple  processor  cores 
on  one  die  [1].  These  manycore  systems  will  require  increasing  memory  bandwidth  at  reasonable 
energy  consumption  if  they  are  to  deliver  improvements  in  application  performance.  Otherwise 
these  systems  may  be  grossly  underutilized  [27]. 

When  the  desired  number  of  cores  cannot  fit  on  a  die  that  is  economical  to  manufacture,  they  are 
spread  across  multiple  sockets.  To  feed  many  cores  spread  across  multiple  sockets  will  require  even 
more  memory  bandwidth.  Each  socket  will  have  its  own  attached  DRAM,  but  in  a  shartxl  memory 
machine  it  must  be  made  accessible  to  the  other  sockets  within  the  system.  This  interconnect  must 
have  an  on-chip  portion  that  connects  all  of  the  cores  within  a  socket  in  addition  to  an  off-chip 
portion  that  connects  all  the  sockets  within  the  system. 

Current  multisocket  systems  often  have  their  off-chip  bandwidth  constrained  by  power  and  pin 
limitations  [14,  18,  23].  As  more  cores  are  integrated  into  a  die  witliin  a  socket,  they  will  need 
even  more  bandwidth,  and  this  bottleneck  will  become  more  troublesome  as  it  is  unlikely  off-chip 
electrical  bandwidth  will  be  able  to  keep  up.  The  energy  required  to  send  a  bit  between  sockets 
is  not  scaling  down  very  quickly  because  the  sockets  are  not  getting  much  closer  physically,  and 
the  materials  used  for  traces  is  not  getting  significantly  less  resistive  or  capacitive.  Even  if  off-chip 
electrical  signaling  becomes  sufficiently  more  energy  efficient,  pin  bandwidth  could  become  the  next 
limiting  factor.  Off-chip  signaling  rates  and  die  sizes  are  not  growing  fast  enough  to  provide  enough 
pin  bandwidth  to  meet  the  growing  demand. 

A  socket’s  limited  off-chip  bandwidth  must  be  divided  up  between  links  to  its  own  locally  at¬ 
tached  DRAM  and  inter-socket  links  to  reach  remote  DRAM  attached  to  other  sockets  (Figure  1.1). 
If  all  of  the  bandwidth  is  allocated  to  the  locally  attached  DRAM,  the  system  will  have  the  max¬ 
imum  memory  bandwidth  possible,  but  it  will  be  disjoint.  In  contrast,  if  all  of  the  bandwidth  is 
allocated  to  the  inter-socket  links,  the  system  will  have  no  memory  bandwidth  but  great  inter-core 
bandwidth.  If  the  two  are  balanced  uniformly  such  that  each  socket  receives  an  equal  amount  of 
bandwidth  from  every  part  of  memory  (remote  or  local)  the  system  will  have  a  Uniform  Memory 
Architecture  (UMA),  and  if  they  are  balanced  non- uni  for  inly,  the  system  will  have  a  Non-Uniform 
Memory  Architecture  (NUMA). 

Systems  trying  to  get  the  most  system  memory  bandwidth  while  coping  with  off-chip  bandwidth 
scarcity  will  be  pushed  towards  a  NUMA  design.  Tliis  is  true  independent  of  the  off-cliip  network 
topology,  because  each  inter-socket  link  occupies  bandwidth  at  two  sockets,  while  a  link  to  DRAM 
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only  o<Tupies  bandwidth  at  one  socket.  Any  baiidwidtli  taken  away  from  the  inter-socket  links, 
can  he  turned  into  twice  the  bandwidth  for  the  links  to  DRAM.  This  encourages  system  designers 
to  skew  the  bandwidth  allocations  in  favor  of  locally  attached  DRAM  instead  of  reaching  other 
sockets,  to  maximize  system  memory  bandwidth. 


Uniform  Bandwidth 


Non-Uniform  Bandwidth 


a)  b) 


Figure  l.T.  Motivation  for  NUMA 


A  NUMA  design  imposes  additional  complexity  on  the  performance  programmer,  as  it  is  crucial 
that  data  is  co-located  with  the  computation  using  it.  This  careful  mapping  is  yet  another  opti¬ 
mization  performance  programmers  must  consider  [27],  but  if  the  memory  system  was  flat  (xmiform) 
it  would  be  unnecessary,  increasing  their  productivity.  Some  multiprogranniied  workloads,  such  ;is 
virtual  iiiachines  running  within  a  datacenter,  will  also  benefit  from  the  scheduling  flexibility  that 
bandwidth  miifonnity  provides.  When  scheduling  jobs,  a  job  could  be  rim  on  the  first  available 
(‘ore  independent  of  where  the  d«ata  it  iiet^ds  resides.  Furthermore,  some  workloads  exhibit  poor 
spatial  locality  so  it  is  difficult  to  sjin^ad  the  data  across  soeJeets  eflVxdively.  If  a  new^  technology 
provided  an  abundance  of  bandwidth,  it  would  be  w^orthw^hile  to  allocate  it  uniformly  to  increase 
programmer  productivity  and  make  the  system  more  flexible. 

In  this  work,  we  leverage  silicon  photonics  to  design  high  and  uniform  bandwidth  multi-socket 
memory  interconnects.  We  present  a  general  network  design  that  can  be  used  to  make  systems 
of  varying  sizes,  and  to  provide  sharcxl  mc^mory  which  makes  the  system  more  usable,  w'e  discuss 
how"  to  ro.asonably  imph'iiK'iit  (  oluTtuicy  on  toj)  of  the  nc'tw'ork.  Because  of  the  nature  of  the 
(h^sign,  it  has  much  less  in(‘(Mitive  to  int ('grate,  wdiich  opens  the  door  to  cliip  disintc^gration  for 
cost  savings.  Overfill,  multi-socket  int('rcorni(H^ts  cire  an  interesting  place  to  exjdore  applications  of 
(  urrent  reseai'ch  in  silicon  photonics  because  of  its  emphasis  on  ofF-(kip  comiimiiicatioii. 
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Chapter  2 

Photonic  Technology  Introduction 


Over  the  last  few  decades,  the  scale  at  which  optical  technology  has  been  adopted  for  (‘oiiinui- 
nicatioii  has  been  steadily  decreasing.  Optical  coinmiiiiicatioii  was  first  used  for  long  distance 
telecommunications,  because  its  high  endpoint  costs  were  amortized  over  very  long  links.  As  pro¬ 
cessing  technologies  have  improved,  the  cost  (delay,  space,  energy,  dollars  . . . )  of  the  endpoints 
have  decreased,  which  in  turn  has  decreased  the  distance  at  which  optical  communication  is  ad¬ 
vantageous.  Continued  technology  advances  along  with  increased  integration  have  eiiabhxl  silicon 
photonics,  which  decreases  the  feasible  distance  down  to  the  inter-chip  and  even  intra-chip  level. 

2.1  Technology  Overview 

In  recent  years,  silicon  photonics  has  been  shown  to  be  an  increasingly  desirable  technology  for  sys¬ 
tem  interconnects  because  of  its  potential  for  higher  bandwidth  density,  greater  energy  efficiency, 
and  lower  latency.  The  technology  is  still  immature  with  many  competing  implementation  propos¬ 
als,  so  projected  performance  on  these  important  metrics  varies  significantly.  To  ground  the  results 
of  our  study,  we  select  a  particular  monolithically  integrated  silicon  photonics  technology  [4],  but 
the  overall  approach  should  be  applicable  to  the  other  current  proposals  because  much  of  it  is  based 
on  general  technology  insights. 

Figure  2.1  shows  a  basic  link  is  comprised  of:  a  light  source,  a  modulator,  a  waveguide,  and  a 
photodetector.  The  modulator  encodes  the  signal  by  absorbing  or  not  absorbing  light  as  it  passes 
by  it  through  the  silicon  waveguide.  At  the  other  end  of  the  waveguide,  the  pliotodetector  senses 
the  changes  in  light  and  decodes  the  signal.  The  electro-optical  and  opto-electrical  conversions  at 
the  endpoints  introduce  a  latency  and  energy  cost  that  needs  to  be  amortized  beyond  a  minimum 
distance  to  be  advantageous  to  electrical. 


Die  1  Die  2 


Figure  2.1:  An  inter-socket  photonic  link 
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Tho  select od  t(*(‘hiiology  pro\'ides  Dense  Wave  Di\^isioii  Miiltiph'xing  (DWDM)  which  (‘ontribiites 
to  its  high  baiidwidtli  (leiisity  (bits/se('oncl/  //in).  DWDM  allows  light  from  different  wavelengths 
to  share  the  same  waveguide  with  minimal  interference,  which  aJlow^s  multiple  logic  al  links  to  share 
the  same  physical  media  without  time  multiplexing.  This  is  enabled  by  putting  rings  which  resonate 
with  a  narrow  freciuency  of  light  onto  the  waveguide,  such  that  when  the  light  resonates  with  a  ring, 
it  is  ])ulled  off  the  waveguide  into  the  ring.  We  can  use  these  rings  along  with  charge  inj(x:tion  to 
make  a  ring  modulator  [11,  20,  21].  Applying  a  charge  to  a  ring  shifts  the  ring's  n^sonant  frecpieiicy 
so  a  particailar  wavelength  can  be  absorbed  or  not  absorbed  to  modulate  the  light. 

A  filter  can  also  be  inaxh'  by  using  tln^e  resonant  rings  [21,  26],  and  the  selec'ted  tecdmology 
uses  two  caseaded  rings  to  get  additional  frequeney  sehxdivit}^  (Double  Ring  Filter).  Since  the 
photodetectors  are  sensitive  to  a  wide  range  of  light  frequencies,  a  double  ring  filter  is  placcnl 
between  the  photodetector  and  the  wvaveguide  so  only  the  correct  wavelength  gets  through  the 
filter  and  strikes  the  photodetector.  These  resonant  rings  are  sensitive  to  a  variety  of  emaromnental 
factors  and  manufacturing  variations,  but  these  ean  be  combated  by  thermally  tuning  the  rings  with 
ill-plane  heaters. 

The  selected  te<'hnolog>'  is  monohthieally  integrated,  and  it  utilizes  a  eurrent  CMOS  manufac¬ 
turing  procoss  whieh  makes  it  much  more  realizable  since  it  leverages  a  great  deal  of  manufacturing 
hardw^ire  investment  and  knowledge.  Other  photonic  proposals  may  be  better  suited  for  transmit¬ 
ting  light,  but  they  use  materials  or  steps  not  currently  pmt  of  a  standard  CMOS  proce.ss  making 
them  more  cost  prohibitive  to  implement  [3.  11,  15]. 

The  light  used  by  the  system  is  generated  by  an  off-chip  laser  beeause  eonventional  CMOS 
processes  are  poorly  suitcxl  for  laser  fabrication.  This  light,  is  brought  on  chip  through  a  fibc^r 
and  then  a  coupler  into  the  waveguide.  On-cliip  light  travels  through  i)oly-Si,  w'liich  can  be  made 
into  a  usable  waveguide  (Figure  2.2)  by  placing  it  on  top  of  shallow  trench  isolation  (STl)  and 
etching  an  air  gap  imderneath  it  [10].  The  air  gap  ludps  to  improve  the  cladding  on  the  bottom  of 
the  waveguide,  bec'aiise  the  STl  is  too  thin  on  its  own.  The  air  gap  does  take  up  silicon  arc^a,  so 
when  jK)ssible  multiple  waveguides  should  share  one  to  amortize  the  overhead.  A  great  advantage 
of  photonics  is  that  once  the  signal  has  been  encodcxi  optically,  that  light  can  be  guided  through 
through  cx)uplers  and  a  fiber  to  anotlun-  chip’s  wavc^guide  without  retransmission  (Figure  2.1), 
enabling  links  that  operate  seamlessly  acToss  long  distance's. 


Figure  2.2:  Cross  section  of  an  on-el lip  waveguide 
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2.2  Performance 


Looking  forward  to  when  tliis  silicon  photonic  proposed  might  be  fully  realizable,  we  coinj)^ire  it 
against  a  projected  optimally  repeated  electric  wire  in  a  22nm  process  and  Tables  2.1,  2.2,  and  2.3 
give  a  sunmniry  of  the  comparison.  Based  on  preliminary  rt^sults  and  device  projections,  the  silicon 
photonic  proposal  assumes  a  signaling  rate  of  lOGbps  (faster  could  be  possible)  and  squei'zes  in  64 
wavelengths  per  direction  [21],  meaning  a  single  link  (fiber  or  waveguide)  has  80GB/s  of  bidirectional 
bandwidth. 


Table  2.1:  Approximate  energy  costs  per  bit 


Quantity 

Electric  (/J) 

Photonic  (fJ) 

Ratio 

On-Chip  Model 

150* 

OfT-Chip  Model 

5000 

150 

Local  On-Chip  Wire  (1  fim) 

0.05 

150 

0.00033 

Intermediate  On-Chip  Wire  (1mm) 

50 

150 

0.33 

Global  On-Chip  Wire  (I0m7n) 

500 

150 

3.33 

Off-Chip  Tra<  e  (40m7n) 

5000 

150 

33.33 

Chij>to-Chip  Link 

(407nm  off-chip,  lOinm  on-chip) 

5500 

150 

33.67 

Table  2.2:  Ai)proximate  latency  costs  per  bit 


Quantity 

Electric  {ps) 

Photonic  (ps) 

Ratio 

On-Chip  Model 

100 

mm 

200  +  lO-e^ 

mm 

Off-Chip  Model 

50  +  5£- 

mm 

200  +  5;^ 

mm 

Local  On-Chip  Wire  (1  /xm) 

0.1 

200.01 

0.0005 

Intermediate  On-Chip  Wire  (1mm) 

100 

210 

0.48 

Global  On-Chip  Wire  (10mm) 

1000 

300 

3.33 

Off-Chip  Trace  (4077i7n) 

250 

400 

0.63 

Chip-to-Chip  Link 

(40mm  off-chip,  10m7/7  on-chip) 

1250 

700 

1.79 

Table  2.3:  Approximate  bandwidth  densities  per  bit.  Photonic  values  sum  the  bandwidth  of  both 
directions 


Electric  (Gb/s/  p,m) 

Photonic  (Gb/s//xm) 

Ratio 

On-Chip 

5 

320 

64.0 

Off-Chip 

0.2 

26 

130.0 

MOO-^  (modulator)  +  50*^  (rocoivor)  4  HOuW  (power  to  thermally  tune  rings)  4  optical  power 
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2.2.1  Power 


Energy  efficiency  =  ^),  especially  off-chip,  has  been  listed  as  one  of  the  strongest  advan¬ 

tages  of  the  selected  photonic  technology.  It  is  important  to  fully  explore  the  three  ways  it  expends 
power: 

•  Encoding /Decoding  power  is  consumed  at  the  endpoints  and  it  includes  electrical  circuits  to 
seriiilize/deserialize  the  signal  from  the  native  system  clock  to  the  transmission  rate  as  well 
as  the  power  consumed  by  charge  injection  to  modulate  the  signal.  This  power  is  insensitive 
to  distance,  is  mostly  dynamic,  and  the  values  quoted  in  Table  2.1  are  for  100%  utilization. 

•  Light  Generation  power  is  burned  by  the  laser  to  produce  the  light  used  for  communication. 
This  power  is  constant,  independent  of  utilization.  It  is  difficult  to  dynamically  adjust  laser 
power.  To  generate  laser  light  more  efficiently,  the  same  laser  is  used  for  multiple  links,  so 
unless  all  of  the  links  are  inactive,  it  is  hard  to  scale  back.  It  is  important  to  note  that  the 
light  generation  power  is  the  amount  of  electrical  power  required  to  produce  the  laser  power 
(light  intensity)  the  system  needs.  Light  generation  power  is  often  overlooked,  and  most  of 
the  prior  work  has  not  added  it  to  the  power  total  with  the  justification  that  it  is  off  chip 
and  thus  does  not  contribute  to  power  density  hotspots  on  the  processor  [25].  Keeping  with 
convention,  for  most  of  this  work  laser  power  will  be  presented  separately,  because  laser  light 
generation  is  an  orthogonal  are^  of  research,  so  converting  it  to  electrical  power  might  be 
misleading.  However,  when  calculating  the  total  power  for  a  system,  a  conservative  estimate 
of  future  laser  efficiency  of  25%  is  used.  This  power  is  strongly  dependent  on  how  much  loss 
the  path  has,  and  Section  2.3  will  present  more  details  about  this. 

•  Thermal  Tuning  power  is  burned  up  by  heaters  to  control  the  ring’s  resonant  frequency  for 
process  variation.  The  observed  sensitivity  is  l//iy/ring/K  and  the  needed  control  range  is 
20K,  so  each  ring  will  burn  20 fiW. 

In  summary,  using  a  silicon  photonic  link  purely  on-cliip  will  not  be  significantly  advantageous 
with  regards  to  energy,  unless  it  travels  a  substantial  distance  (>  S/nrn),  however  off-chip  it  could 
be  more  than  an  order  of  magnitude  more  efficient. 

2.2.2  Latency 

Most  of  the  latency  for  a  silicon  photonic  link  is  at  the  endpoints,  since  light  propagates  rapidly. 
The  endpoint  latency  is  a  consequence  of  serializing  and  deserializing  the  data  from  the  native  clock 
rate  to  the  transmission  rate  of  10  Gbps.  Table  2.2  shows  that  photonics  only  has  lower  latency 
than  electrical  beyond  2.2mm  on-chip.  As  mentioned  earlier,  the  photonic  links  can  go  inter-chip 
without  retransmission,  so  in  those  casf^  the  latency  gap  between  electric  and  photonic  is  further 
reduced. 


2.2.3  Area 

On-chip  waveguides  are  larger  than  wires  and  they  have  a  wider  pitch.  The  air  gaps  makes  the 
waveguides  effectively  wider  because  no  circuits  can  be  placed  over  them.  Even  though  waveguides 
take  up  more  area  than  wires,  there  is  so  much  more  bandwidth  per  waveguide  from  DWDM  and 
bidirectional  communication  that  it  still  obtains  a  large  bandwidth  density  advantage  (Table  2.3). 
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Off-chip  this  admiitage  becomes  more  significant  because  they  have  conijmrable  pitches,  with  the 
same  data  rates,  but  a  single  fiber  contains  64  links  in  each  direction  while  an  electrical  pin  only 
implements  a  single  link  in  one  direction. 

2.3  Laser  Power 

Every  optical  component  introduces  some  amount  of  loss  to  the  signal,  increasing  the  laser  power 
needed  to  ensure  sufficient  light  reaches  every  photodetector.  As  mentioned,  in  2.2.1,  light  genera¬ 
tion  power  is  significant,  and  it  is  directly  proportional  to  laser  power.  We  define  the  optical  critical 
path  as  the  path  with  the  greatest  loss  between  the  light  source  and  the  last  photodetector.  Along 
the  optical  critical  j^ath,  the  laser  power  required  to  overcome  losses  tends  to  grow  exponentially 
rather  than  linearly,  so  a  reasonable  design  can  quickly  become  unreasonable  when  scaled  up.  The 
network  layout  and  size  can  contribute  greatly  to  loss,  so  cnxeful  physic'al  layout  design  is  essential 
to  save  i)ower. 

Using  Figure  2.3,  we  can  trace  out  an  example  optical  critical  path  and  show  where  the  losses 
come  from.  Table  2.4  is  included  to  give  sense  of  the  relative  magnitudes,  since  the  absolute  values 
could  change  as  the  technology  matures.  The  optical  critical  path  starts  at  the  laser,  and  ends 
at  the  last  photodetector  (the  one  attached  to  the  filter  for  the  green  wavelength).  Traveling  any 
distance,  the  light  experiences  some  loss,  which  is  negligible  for  off-chip  fibers  and  significant  for 
on-chip  waveguides.  To  go  from  from  off-chip  to  on-chip  or  vice  versa,  the  light  travels  through  a 
couj^ler,  which  incurs  loss  substantial  enough  that  links  which  span  more  than  two  chips  may  be 
untenable.  Once  the  light  has  been  brought  on-chip,  it  typically  is  fanned  out  through  splitters 
to  make  all  of  the  needed  links.  When  the  waveguide  crosses  another  waveguide,  it  also  incurs 
loss  because  all  waveguides  are  routed  in  the  same  plane  with  this  technology.  Crossing  losses  can 
be  significant,  because  often  multiple  waveguides  axe  routed  j^arallel  to  each  other,  so  a  crossing 
actually  results  in  many  crossings. 

There  are  a  variety  of  losses  caused  by  the  resonant  rings.  When  light  passes  by  a  filter  tuned  for 
another  wavelength,  it  experiences  through  loss  (Filter  to  through  node).  When  it  passes  through 
the  intended  filter  and  reaches  the  photodetector,  it  experiences  drop  loss  (Filter  to  drop  node). 
Modulator  insertion  loss  is  incurred  when  a  wavelength  of  light  passes  by  a  modulator  tuned  for 
that  frequency  that  is  currently  inactive. 

Another  important  consideration  is  the  non-linearity  limit  imposed  by  the  Poly-Si  waveguide. 
As  the  combined  power  of  the  light  inside  a  waveguide  grows,  there  is  a  non-linear  increase  in 
the  amount  of  light  that  escapes.  To  combat  that  loss,  more  laser  i)ower  is  used  whicli  results 
in  even  more  loss,  so  its  best  to  keep  the  total  power  for  a  waveguide  within  reasonable  limits. 
Normally  how  many  wavelengtlis  can  be  put  into  a  waveguide  is  set  by  the  frequency  selectivity 
of  the  photonic  components  used,  but  the  number  of  wavelengths  used  per  waveguide  may  also  be 
set  by  the  path  loss  which  determines  the  power  required  per  wavelength  and  thus  the  number  of 
wavelengths  that  can  fit  under  the  non-linearity  limit.  The  designs  presented  later  in  this  study 
were  made  to  have  low  loss,  and  they  should  be  able  to  carry  64  wavelengtlis  per  direction  without 
issue. 
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Chip 


Table  2.4:  Optical  Power  Costs  [4] 


Component 

Loss  (dB) 

Coupler 

1.0 

Splitter 

0.2 

Non-Linearity 

1.0 

Filter  (to  through  node) 

0.01 

Modulator  Insertion 

0.5 

Waveguide  Crossing 

0.05 

Waveguide  (per  ein) 

1.0 

Optical  Fiber  (per  cm) 

0.000005 

Filter  (to  drop  node) 

1.5 

Photodetector 

0.1 
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2.4  Design  Implications 


As  shown  by  Tables  2.1  &  2.3,  the  selected  photonics  technology  can  provide  a  tremendous  amount 
of  off-chip  bandwidth,  because  of  its  energy  efficiency  and  bandwidth  density  advantages.  Replacing 
the  electrical  inter-socket  links  with  photonic  ones  will  enable  much  more  bandwidth  to  each  socket. 
Used  in  conjunction  with  an  electrical  on-chip  network,  it  could  still  result  in  dramatically  higher 
system  bandwidth. 

Even  though  entirely  on-cliip  photonic  links  do  not  hold  much  of  an  advantage  over  electrical 
on-chip  links,  if  photonics  is  used  for  the  off-cliip  network,  it  makes  sense  to  continue  seamlessly 
on-chip  because  the  conversions  costs  will  have  already  been  paid.  By  using  these  seamless  links, 
off- (  hip  networks  and  on-chip  networks  are  flattened  into  one  domain.  To  get  the  most  from  this 
flat  network  will  require  co-design  of  the  on-chip  and  off-chip  networks. 

In  this  thesis,  the  connection  between  a  memory  controller  and  a  DRAM  module  is  assumed 
to  be  electrical.  Future  work  could  investigate  a  photonic  link  between  a  memory  controller  and 
DRAM,  and  doing  so  should  not  change  the  results  of  this  study. 
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Chapter  3 

Design  of  a  Photonic  Multisocket 
System 


Section  2  shows  that  silicon  photonics  has  great  potential,  and  in  this  section  we  present  a  network 
designed  to  take  full  advantage  of  it.  When  designing  a  system  known  to  be  multi-socket,  it  is 
important  to  consider  the  off-chip  network  in  addition  to  the  on-chip  network,  and  co-designing  the 
on-chip  and  off-chip  networks  makes  best  use  of  seamless  photonic  links. 

3.1  System  Assumptions 

To  provide  structure  for  the  rest  of  this  study,  we  make  some  assumptions  about  the  target  system. 
There  are  a  variety  of  architectures  that  could  take  advantage  of  the  transistor  gains  from  Moore’s 
law,  but  to  acliieve  liigh  computational  throughput  on  a  workload  without  high  arithmetic  intensity, 
they  will  all  require  high  memory  bandwidth.  For  this  work,  we  envision  a  system  comprised  of 
many  simple  in-order  cores,  but  some  of  the  higher  level  results  should  still  be  applicable  to  other 
arcliitectures. 

To  ground  our  design  with  real  numbers  (Table  3.1),  we  assume  in  a  22  nm  process  with  400  min^ 
of  silicon,  it  will  be  possible  to  fit  256  cores  running  at  2.5GHz  [4].  Each  of  these  cores  will  include 
4- way  SIMD  with  Fused  Multiply  Accumulate  (FMAC),  giving  the  the  system  a  total  of  5  TFLOPS 
of  peak  performance.  The  amount  of  memory  bandwidth  needed  to  adequately  supply  this  system 
will  depend  on  the  arithmetic  intensity  of  the  target  workload,  but  the  frequently  desired  ratio  of  one 
byte  of  memory  bandwidth  per  one  flop  will  support  many  desired  workloads,  which  will  equate  to 
5  TBps  of  memory  bandwidth  for  the  system  [27].  This  bandwidth  will  be  supplied  by  16  memory 
controllers,  and  each  of  these  memory  controllers  may  be  attached  to  multiple  physical  DRAM 
channels,  but  from  the  point  of  view  of  the  rest  of  the  system,  each  memory  controller  is  a  single 
endpoint  of  arbitration  and  contention.  We  also  assume  that  this  system  will  be  implemented  over 
four  sockets,  so  each  socket  will  have  one  quarter  of  the  cores  and  memory  controllers.  We  assume 
a  shared-memory  system,  where  photonics  is  used  to  connect  processor  to  memory  controllers,  not 
cores  to  cores. 
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Baseline  Socket 

Max  Configuration 

Sockets 

1 

4 

Cores 

64 

256 

Clock  Rate 

2.5  GHz 

2.5  GHz 

Total  Silicon  Area 

100  mm^ 

4(X)  inni^ 

Memory  Bandwidth 

1.25  TBps 

5  TBps 

Memory  Controllers 

4 

16 

Table  3.1:  Target  system  assumptions 


3.2  Topology  Insights 

A  network  designer  must  balance  the  needs  of  the  target  workload  with  what  the  technology  allows. 
The  assumed  workload  for  this  system  will  need  liigh  bandwidth  to  feed  many  functional  units, 
but  this  bandwidth  must  be  i)rovided  uniformly  (equally  by  all  memory  controllers)  to  simplify 
programming  and  to  increase  portability.  Memory  latency  must  be  keep  moderately  low  since  the 
cores  are  mostly  scalar,  so  they  are  incapable  of  cheaply  tolerating  too  much  memory  latency.  By 
Little’s  Law,  the  amount  of  data  in  flight  is  proportional  to  the  product  of  latency  and  bandwidth. 
If  the  memory  latency  is  increased,  additional  area  will  need  to  be  dedicated  to  holding  and  tracking 
the  increased  amount  of  data  in  flight,  which  will  make  the  simple  cores  more  expensive. 

A  low-diameter,  high-radix  network  will  achieve  these  goals,  and  it  will  map  well  to  the  selected 
silicon  photonics  technology  proposal.  Low-diameter  networks  are  known  for  low  latency  due  to 
their  low  hop  count,  as  well  as  having  more  uniform  latency  because  there  is  less  variance  in  path 
length  [8].  This  low  hop  count  also  results  in  more  uniform  bandwidth  because  there  are  less  hops 
for  links  to  get  congested  by  other  traffic  on  the  network.  To  reach  the  same  number  of  endpoints, 
a  lower-diameter  network  must  compensate  with  a  higher  radix.  With  a  constant  bandwidth  per 
endpoint,  increases  in  radix  result  in  decreased  bandwidth  per  link,  which  can  be  problematic  as  it 
will  increase  the  serialization  latency. 

A  common  challenge  with  implementing  low-diameter,  high-radix  networks  in  electrical  tech¬ 
nologies  is  that  the  links  tend  to  become  longer,  and  as  a  consequence,  consume  a  significant  amount 
of  power.  The  selected  photonic  technology,  however,  is  mostly  distance  insensitive  with  respect  to 
latency  and  power.  Another  challenge  with  implementing  these  global  links  is  that  when  mapped 
to  a  physical  substrate,  the  bisection  bandwidth  required  is  high.  This  can  be  troublesome  to  route 
off-cliip,  but  fortunately  the  selected  photonic  technology  provides  great  off-chip  bandwidth  den¬ 
sity.  In  contrast,  if  tliis  network  was  implemented  electrically,  the  bisection  bandwidth  would  be 
constrained  by  the  electrical  pins,  limiting  the  total  network  bandwidth.  Tliis  would  encourage  the 
network  designer  to  use  a  higher-diameter,  lower-radix  network  to  reduce  the  demand  for  bisection 
bandwidth  which  will  also  reduce  the  demand  for  off-chip  bandwidth,  at  the  price  of  longer  and 
less  uniform  latencies  and  less  uniform  band  widths. 

Our  design  takes  the  low-diameter,  high-radix  network  to  the  extreme,  by  using  a  simple  fully- 
connected  network  (Figure  3.1a)  as  a  starting  point.  Each  network  endpoint  (core  or  memory 
controller),  will  have  a  high-radix  switch  with  a  photonic  link  for  every  possible  endpoint.  A  single 
photonic  hop  minimizes  latency  while  maximizing  bandwidth  uniformity.  A  one-hop  tojiology  will 
become  a  limiting  factor  as  the  design  is  scaled  up  to  higher  numbers  of  endpoints,  since  it  will  also 
increase  the  radix.  Increasing  the  radix  will  hurt  performance  because  the  serialization  latency  will 
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grow  ns  the  links  get  naiT()W(u\  and  the  power  and  <irea  for  tlu‘  (d(K:trical  switch  will  grow  as  its 
radix  does.  For  the  intended  design  scale  of  a  single-board,  coini)elling  systtnns  might  be  j>ossible 
utilizing  the  selected  photonic*  technology  without  taking  up  an  unreasonable  amount  of  area  or 
power. 
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Figure  3.1:  Topological  benefits  of  concentration,  a:  Ftilly-connected  network  b:  Fully-conn<rted 
network  with  core  concentration  c:  FtiIly-(‘oniiect(*d  network  with  core  concentration  done*  by  a 
shared  cacdie 


3.2.1  Concentration 

Taking  the  simple  initial  design  of  a  fnlly  coniie(‘t(xl  toi)ology*  between  individmil  con's  and  memory 
controllers  and  scaling  it  to  ine<'t  the  tjirget  system  parameters  will  result  in  j^oor  p(‘rformanc<\. 
The  effective'  radix  is  liigh  because'  there*  are  so  many  memory  ce)ntrollers  and  cores,  making  eaedi 
core  memory  controller  link  se)  narrow^  (fe^r  the  target  system:  of  a  core’s  bandwidth)  that  the 

serialization  latency  is  significant.  It  is  also  statistically  harder  for  a  simple  core  to  have  ene)ugh 
memory  nxpiest  parallelism  to  kt^ep  all  of  those  links  busy  simnltanoonsl}^,  h'aving  many  e)f  them 
iinderutilizexl.  Low^  utilizatie)n  is  worrisome  because  statie*  power  eonst  it  Tile's  a  lai  ge  fract  ion  of  a 
plioteinic  link’s  power,  but  this  can  be  avoiele'd  by  using  conceiitraUon  to  share  links  to  increase 
utilization  [8]. 

By  gre:)nping  e‘e)re\s  inte)  clusters  (Figure  3.1b),  (‘oiicent ration  widens  the  links  to  the  inemory 
controlk'rs,  which  elrastie'ally  cuts  elown  on  serialization  latency.  Since  each  cluster  contiiins  multiple 
(ores,  within  in  a  cluster  it  is  statistically  easier  to  generate  enough  memory  request  pai*allelism 
to  obtain  higher  utilization.  Concentration  combines  the  switches  and  links  at  the  core  side  of 
the  network  to  ronluce  the  effective  radix  of  the  network.  This  has  the  desired  effect  of  improving 
serialization  latency,  but  it  could  also  be  used  to  build  larger  networks  wdth  the  same  serialization 
lateney. 

Since  the  core's  within  a  cluster  will  be  physically  ii('<ir  each  other  a,s  they  share  the  same 
photonic  cluster- mo mor>^  links,  they  could  also  share  their  hist  level  cache  (Figure  3.1c).  There  are 
architectural  benefits  of  sharing  a  ca(‘h(*,  and  current  caches  have  been  built  wdth  8- way  slniring 
[17].  These  short  links  Ix'twx't'ii  (‘orc^s  and  (*a(‘hes,  and  ca<‘hes  and  the  local  switcJi  should  ('lec'trical, 
since  it  is  loo  short  of  a  distance  for  photonics  to  be  advantageous.  For  the  ixjst  of  our  designs  we 
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assinno  8  core  clusters,  which  obtains  the  benefits  of  concentration  without  overly  burdening  the 
cluster  int(Tconnect.  but  clusters  of  2  Ki  cores  should  also  be  feasible. 


3.2.2  Off-Chip  Connections 

Wit  h  niiilti-sockct  systcuiis  it  is  d(^irable  if  the  same  chi])  can  be  used  by  simply  varying  the  niiiiiber 
connected  together  (even  if  only  ])owers  of  two),  because  it  will  increase  the  volume  of  that  ])art, 
lowering  its  cost.  This  scalable  reusability  is  difficult  to  obtain  while  providing  the  goal  of  uniform 
memory  bandwidth.  As  shown  in  Figure  3.2a,  if  the  connections  betwwn  clusters  and  memory 
controllers  are  made  on-chip,  that  bandwidth  is  fixed  because  we  want  to  reuse  the  same  chip 
in  all  systems.  Using  that  chip  to  build  systems  with  a  variable  number  of  sockets  populated 
will  require  some  bandwidth  (on-chip  or  off-chip)  be  turned  off  to  keep  the  bandwidth  allocation 
between  the  memory  controllers  on-(‘hip  and  the  memory  controllers  m  other  sockets  uniform.  If 
every  (‘omie<'tioii  is  made  off-chi])  (Figure  3.2b),  the  bandwidth  allocations  can  be  changed  off-chi]) 
without  modifying  the  chi]). 


On-Chip  &  Off-Chip 


Chip  1  Chip  2 


a) 


Off-Chip  Only 


Chip1 


Chip  2 


b) 


Figure  3.2:  Topological  benefits  of  all  connections  off-chip 


To  implement  this,  each  cluster  will  have  enough  links  to  sup])ort  the  maximum  number  of 
memory  controllers  in  the  largest  ])ossible  system,  and  the  memory  controllers  will  have  enough 
links  to  support  the  maximum  number  of  clusters  in  the  largest  possible  system.  In  a  fully  po])ulated 
system,  all  of  these  links  will  be  connected  one  to  one.  If  the  system  Inis  only  half  of  its  sockets 
])0])ulated,  there  will  be  two  links  betwemi  each  cluster  and  each  memory  controller.  These  links 
could  be  ganged  together  to  make  a  single  logical  link  of  twice  the  bandwidth,  or  they  could  be  ko])t 
s(q)arate  to  allow  for  greater  memory  re(]iK^t  parallelism.  In  the  case  of  a  singh^  so(‘ket  system,  the 
off-chip  fibers  are  looped  back. 

It  might  seem  that  routing  all  traffic  off-chip  is  wasteful  when  some  of  it  could  be  doin^  purely' 
on-chip,  but  with  photonics  this  ])enalty  is  greatly^  reduced.  Most  of  th(^  latency  and  on-chip  energy' 
cost  of  a  photonic  link  is  at  the  end])oints,  so  whether  the  link  is  purely  on-cliip  or  not  only  affects 
optical  power.  Depending  on  what  the  optical  critical  path  loss  is,  this  change  in  optical  power  may 
be  truly  negligible.  Tliis  is  in  contrast  to  ehx'trical  off-chip  links  which  consume  sufficiently  more 
energy  and  area  such  that  an  efficient  design  will  never  send  data  off-chip  unless  forced.  Taking 
a<lvantage  of  the  off-chip  bandwidth  density,  energy  efficiency,  and  distance  insensitivity  of  photonic 
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links,  for  the  flexibility  it  provides  and  for  the  iinifonnity  it  maintains,  the  benefits  of  making  all 
ronneclions  olf-chip  outweigh  the  small  light  generation  power  increase. 

3.3  Packaging 

lo  pac  kage  the  topolog^*^  into  a  j)hysical  clc^sign  will  rec^uire  mcn'c  innovation.  Beeause  all  the 
cluster  memory  controller  connections  are  off-chip,  each  chip  will  have  two  tyj)('.s  of  fibers:  those 
originating  at  clusters  and  those  originating  at  meiiicny  contrcdlcirs.  Somehow  c:)fT-cliip,  all  of  thc3se 
fibers  must  be  appropriately  attached.  To  keep  the  fibers  more  organized,  tlu^y  can  bci  grouped 
into  ribbons,  whic'h  simplifies  assembly.  As  the  number  of  soc  kets  in  the  system  grcjws,  the  number 
of  ribbcnis  that  must  be  attac-hed  could  be  become  unreasonable,  because  the  topologv'  is  fully 
c-oimec-ted,  so  c^ach  socket  must  a  have  a  ribbon  to  every  sc)c:ket  (including  itself).  Figure  3.3a 
shows  this  for  the  four  socket  case. 


Without  Star  Fiber  Coupler  With  Star  Fiber  Coupler 


Figure  3.3:  Comparison  of  with/withont  star*  fiber  coupler 


A  star  fiber  coupler  provides  the  iic^eded  all-toall  connectivity  while  greatly  simplifying  the 
fiber  routing  (Figure  3.3b).  The  star  fiber  couj^ler  acts  as  a  hub  chip,  so  independent  of  system 
size,  each  socket  only  needs  to  attach  two  ribbons  (one  from  its  clusters  and  one  from  its  memory 
controllers)  to  the  coupler,  and  it  will  create  the  all-to-all  connections.  As  shown  in  Figure  3.4,  all 
of  the  cluster  ribbons  attach  to  one  side  of  the  coupler,  and  all  of  the  memory  controller  ribbons 
attach  to  the  other  side.  The  ribbons  from  both  sides  come  in  orthc>gonal  to  each  other  so  eacdi 
ribbon  cross(^s  every  other  ribbon.  In  the  example  shown,  four  ribbons  of  four  fibers  come  in  each 
side,  so  efF(x*tively  it  is  as  if  there  is  a  fiber  between  every  socket  including  itself  (one  fiber  gets 
looped  back). 

The  star  fiber  coupler  can  be  generalized  to  support  cases  when  there  aie  more  fibers  than 
sockets  or  when  multiple  fibers  are  destined  for  each  socket.  It  is  a  completely  passive  device, 
whose  only  j^urpose  is  to  precisely  hold  ribbons  such  that  their  fibers  can  be  efficiently  coupled. 
The  star  fiber  coupler  should  be  comparably  inexpensive,  and  along  with  some  of  the  ribbons,  are 
the  only  things  to  change  between  different  system  sizes. 

To  lay  the  system  out  on  a  board,  the  compute  dies  that  contain  the  clusters  and  memory 
controllers  are  j^laced  Jiround  the  star  fiber  coujdcr  as  shown  in  Figure  3.5a.  Each  of  the  compute 
di('s  is  surrounded  by  its  own  locally  attached  DRAM  to  reduce  the  distance  for  the  elect ric^il  links 
b(’tweeii  theun.  The  menior}*  controllers  are  evenly  sj)aced  around  the  edge  of  the  die  to  i)rovi(le 
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Figure  3.4:  ScJieinatic  of  star  fiber  c‘on})ler 
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Fignn'  3.5:  System  layout 
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(lie  casi('st  (exposure  for  wiring  to  DRAM.  By  using  pli()1oiii(:.s  for  inter-socket  links,  only  a  small 
amount  of  area  m^eds  to  be  dediaited  to  the  fibers,  leaving  the  rest  of  tlie  pin  area  for  connect  ing 
to  DRAM  or  attaching  to  power  and  ground.  The  ribbons  are  attached  only  at  the  endpoints  by 
vertical  couplers  and  the  ribbons  will  float  freely  beneath  th<'  board  (Figure  3.5b),  so  they  can  avoid 
the  heat  sinks  of  the  compute  dies.  A  more  dense  board  layout  might  reduce  ribbon  lengths,  biit  it 
could  significantly  complicate  the  much  more  costly  electrical  signaling  to  DRAM  or  increase  the 
power  density.  Extra  distance  in  the  ribbon  is  tolerable  since  the  additional  o})tical  power  loss  and 
the  increase  in  delay  are  both  negligible. 

3.4  Die  Layout 

The  layout  of  the  photonic  components  on-chip  is  crucial  because  it  cmi  greatly  affect  the  optical 
power.  Without  careful  design,  the  loss  along  the  optic'al  critical  path  quickly  becomes  so  great 
that  the  laser  power  becomes  unreasonable.  Essentially  the  designer’s  job  is  to  take  all  of  the  logical 
links,  map  those  to  Nvavelengths,  and  then  map  those  to  appropriate  waveguides.  The  following 
sections  highlight  the  optimizations  used  to  make  an  efficient  layout,  such  as  the  ()4  core  die  layout 
in  Figure  3.6. 


Figure  3.6;  Die  layout  for  64  core  die  desigiuxl  to  support  a  256  core  system 
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3.4.1  Nested-U  Waveguide  Layout 

Laying  out  the  waveguides  in  a  nested- U  configuration  ns  shown  in  Figure  3.6  can  lielj)  combat 
two  sometimes  avoidable  sources  of  ojjtical  loss:  waveguide  length  and  crossing  loss.  By  bringing 
the  ])Ower  fiber  in  on  one  side  and  the  iiiter-socket  ribbons  on  the  other,  the  waveguide  distance  is 
minimized  while  still  reaching  all  the  needed  end])oiiits.  The  nested-U  layout  guarantees  the  the 
waveguide  distance  is  less  than  or  equal  to  the  length  of  the  chip  plus  the  width  of  the  chip.  A 
single  crossing  doesn’t  contribute  too  much  loss,  but  quite  often  waveguides  are  routed  in  jmrallel  so 
a  crossing  will  intersect  multiple  waveguides  and  then  the  losses  multiply.  Nesting  the  waveguides 
removes  all  crossings,  since  they  always  go  around  each  other. 

3.4.2  Cluster  Striping  Across  Waveguides 

With  the  nested-U  waveguide  layout,  a  waveguide  from  the  power  fiber  to  the  inter-socket  ribbon 
actually  passes  by  more  than  one  cluster.  To  load  a  waveguide  with  wavelengths  from  only  one 
cluster  exclusively  is  wasteful,  because  later  on  those  wavelengths  will  need  to  be  mixed  for  the 
inter-socket  fibers.  Striping  a  cluster’s  wavelengths  across  all  the  waveguides  that  pass  by  reduces 
the  need  to  mix  wavelengths  later  on. 

In  the  example  in  Figure  3.6,  eight  waveguides  pass  four  clusters.  If  each  cluster  put  all  of  its 
waveleiigtlis  on  two  waveguides,  somehow  the  wavelengths  will  need  to  be  shuffled  around  such 
that  they  map  aj^propriately  to  the  four  fibers  that  go  between  each  socket.  A  device  like  the 
one  presented  in  Section  4.2.1  could  accomjilish  the  needed  mixing,  but  with  striping  it  is  often 
unnecessary. 


3.5  Evaluation 

As  mentioned  in  Section  3.1,  the  network  presented  is  designed  to  support  4  sockets  of  64  cores,  for 
a  total  of  256  cores.  As  an  early  evaluation  of  feasibility,  we  analyze  its  interconnect  i^erformance 
using  conservative  overestimates  (Table  3.2).  The  system  at  theoretical  peak  can  provide  each  core 
with  the  desired  1  byte  :  FLOP,  for  a  total  of  5  TBps  of  memory  bandwidth. 


Table  3.2:  Overestimates  for  Quad  Socket  Interconnect  for  256  Cores  Total 


Quantity 

Value 

Total  Power 

9.77W 

Latency 

Ins 

Area  (per  socket) 

4.2  mm^ 

Figure  3.7  shows  a  breakdown  of  where  the  power  is  consumed  in  the  interconnect.  For  our 
analysis  we  use  the  impractical  100%  utilization  to  show  what  the  peak  power  could  be.  With 
0%  utilization,  the  encoding/decoding  power  will  scale  down  to  about  half  of  what  it  is  at  peak 
utilization,  but  the  rest  of  the  interconnect  power  is  static  and  will  not  change  based  on  activity. 
The  encoding/decoding  power  is  directly  related  to  the  number  of  photonic  endpoints,  and  with 
a  constant  number  of  cores  it  will  scale  directly  with  the  amount  of  offered  bandwidth  per  core. 
Light  generation  power  is  burned  in  off  chip  lasers  so  it  adds  to  the  system  wall  power  but  not 
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Figure  3.7:  Breakdown  of  network  power  eonsnmption  for  25G  core  systein  with  04  cores  per  die 


to  the  compute  die’s  power  density.  For  comparison,  we  c'onverted  htser  i)ower  to  light  generation 
power  assuming  a  conservative  laser  (‘fficiency  of  259f>.  1  hermal  timing  power  is  s(d  by  the  mnnber 
of  rings  which  is  also  directly  proportional  to  tlic^  number  of  photonic  endpoints.  This  powen*  is 
contimiously  burned,  but  it  is  not  a  large  overall  contributor  (Figure  3.7). 

The  latencv  will  depend  on  how  far  apai’t  the  sockets  m'e  placed,  but  if  the  off-chip  fiber  is 
under  llcin,  the  latency  will  be  under  Ins  (2-3  cycles  for  out  target  clock  of  2.5GHz).  This  latency 
is  actually  quite  good  when  it  is  juit  in  context  with  other  steps  in  memoiy^  operations  such  as  L2 
ca<*he  access  latencies  or  DRAM  access  latencies.  For  our  target  system,  the  serialization  latency 
will  be  16  cycles  for  a  64B  cache  line,  so  in  18-19  cycles  a  cache  line  could  move  from  a  memory 
eontroller  to  a  cluster’s  cache. 

The  area  was  grossly  overestimated  to  give  genuTous  gaps  betwinm  photonic  (‘omponents  and 
the  transistors  around  them.  The  area  in  Table  3.2  is  per  die,  and  in  our  target  system  each  is 
100 mni^,  so  that  is  only  4.2%  overhead. 

Integrating  a  new  technology'  will  have  its  costs,  and  they  will  have  to  be  justified  by  dramatic 
performance  improvements.  Fortunately,  the  photonic  network  presented  here  will  make  some  other 
parts  of  the  system  cheaper  or  easier  to  design.  For  example,  since  all  inter-socket  communication 
will  be  carried  over  fibers,  this  will  dramatically  reduce  the  number  of  traces  that  need  to  be 
routed  on  the  printed  circuit  boards  (PCB).  This  will  make  the  PCB  easier  to  design,  cheaper  to 
manufacture,  and  it  will  leave  more  space  for  other  signals.  Routing  all  inter-soeket  data  through 
fibers  will  ^ilso  mean  that  there  will  be  less  pins  coming  out  of  the  socket,  allowing  for  a  smaller 
and  cheaper  pac'kage  to  b<'  used.  The  incre;ise  in  delay  or  energy^  for  an  increase  in  hber  length 
is  rmuginal,  which  will  give  the  system  designer  more  flexibility  in  where  they  place  sockets.  In 
suinmary,  using  photonics  simplifies  much  of  the  rest  of  the  system,  which  will  hopefully  lessen  the 
(  ost  of  adopting  a  new  technolog.y. 
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Chapter  4 

Die  Size  Exploration 


The  design  presented  in  Section  3  can  be  generalized  to  handle  greater  numbers  of  cores  or  even 
different  die  sizes.  Since  all  connectivity  is  off-chip  and  we  leverage  the  distance  insensitivity  of 
l^hotonics,  there  is  less  motivation  to  integrate  and  an  economic  incentive  to  disintegrate. 

4.1  Incentives  for  Disintegration 

Disintegration  may  be  able  to  reduce  the  cost  of  the  system  (relative  to  another  made  with  the 
same  template).  Smaller  dies  could  reduce  costs  by: 

•  IncrenseA  yield.  Figure  4.1  shows  the  relative  costs  of  manufacturing  400  mm^  of  silicon  as 
one  whole  die  or  many  smaller  dies.  Although  the  combined  cost  of  the  smaller  dies  is  always 
cheaper  due  to  increased  yield,  most  of  the  gain  can  be  had  by  splitting  the  die  four  ways 
to  get  a  3x  cost  advantage.  Figure  4.1  is  from  a  simple  model  [12]  that  only  taTes  into 
account  parameters  for  area  and  defect  densities.  In  the  real  world  there  will  also  be  fixed 
costs  (packaging,  assembly,  and  test)  per  die  that  will  make  the  systems  with  the  smallest 
di(^  less  desirable,  but  there  still  will  be  significant  advantage  to  using  multiple  moderately 
smaller  dies  rather  than  a  single  large  die. 

•  Better  binning.  Since  the  dies  are  smaller  they  can  be  binned  on  a  finer  granularity  to  reduce 
the  impact  of  process  variation.  Within  a  small  die,  the  probability  of  there  being  high  jirocess 
variance  is  reduced,  allowing  a  greater  number  of  high  performance  dies  to  be  sold. 

•  Greater  design  reuse.  As  mentioned  jireviously,  being  able  to  use  the  same  die  in  systems  of 
different  sizes  allows  for  greater  amortization  of  non-recurring  eiigincf^ring  (NRE)  costs  over 
the  increased  manufacturing  volume.  Smaller  dies  are  easier  to  reuse  because  they  supjiort  a 
greater  variety  of  system  sizes. 

Smaller  dies  could  also  make  system  design  ensier.  With  smaller  dies,  possibly  spread  farther 
apart,  the  board-level  power  density  of  the  system  is  reduced  making  cooling  easier.  It  will  be  even 
easier  to  interface  to  adjacent  electrically  connected  DRAM  with  smaller  dies,  since  there  will  be 
less  memory  surrounding  each  die. 
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Figure  4.1:  Relative  total  costs  for  400nnii^  of  total  silicon  area 


4.2  Scaling  the  Design 

The  design  presented  in  Section  3  can  scale  to  some  other  sizes,  but  in  this  section  we  des('ribe  two 
further  photonic  structures  that  increase  the  feasible  range  of  designs. 

4.2.1  Mixer 

DWDM  allows  multiple  logical  links  to  share  the  same  waveguide,  but  when  a  link  needs  to  cross 
to  another  waveguide  a  mixer  can  be  used.  For  each  waveguide  on  one  side,  all  of  its  wavelengths 
are  evenly  and  disjointly  distributed  across  the  waveguides  on  the  other  side.  It  is  a  bidirectional 
component  ,  and  Figure  4.2  shows  a  simplified  case,  where  two  wavelengths  from  one  waveguide  are 
separated  onto  two  waveguides.  It  is  possible  to  extend  this  design  to  handle  multiple  waveguide's 
per  input  group,  so  a  NxN  mixer  {k  wide)  mixes  N  groups  of  k  waveguides  each.  With  this  abstract 
notation,  a  wide  range  of  components  can  be  classified  as  mixers,  and  many  of  these  special  case's 
have  alrea<ly  appeared  in  various  other  photonic  (h^igns  [fi,  24,  28]. 


Figure  4.2:  Simplified  2x2  mixer  (1  wide).  Only  one  waveguide’s  wavelengths  shown  for  simplicity. 


To  take  the  system  from  Section  3.4  from  250  cores  total  to  1024  cores  total  (still  64  cores  per 
socket)  will  reepiire  two  2x2  mixers  {8-wi(le)  placed  where  the  int(n-socket  ribbons  attach  to  the 
on-chip  waveguides  (Figure'  4.3).  To  reach  1024  cores  with  64  core  dies  (100 mm^),  there  are  16 
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sockets  so  each  inter-socket  link  has  only  1  fiber.  Scaling  in  this  manner  keeps  the  baiuiwidth  pen- 
core  constant,  but  it  docs  come  from  a  gre^ater  number  of  memory  controllers.  I  ho  input  groups  to 
the  mixers  correspond  to  the  groups  of  waveguides  on  the  die.  Without  striping^  the  mixers  would 
have  to  have  more  input  groups. 


Figure  4.3:  Die  layout  for  (>4  core  die  dc^signc'd  to  support  a  1024  core  systenn 


4.2.2  Add-Drop  Multiplexer 

When  there  are  more  dies  in  the  system  than  waveguides  on  a  die  (this  often  happens  with  small 
dies),  an  add-drop  multiplexer  (ADM)  can  be  used  to  fan  out  the  wavelcnigtlis  of  one  waveguide 
onto  multiple  underloaded  waveguides.  Tliis  component  is  bidirectional,  so  from  C3ne  direction  it 
looks  like  a  splitter  but  from  the  other  it  looks  like  an  aggregator.  As  shown  in  Figure  4.4  this 
can  be  done  without  crossings.  Alternatively  the  die  layout  could  simply  under-fill  the  waveguid(\s 
oii-chip,  but  this  wastes  aiea  and  the  optical  loss  through  the  ADM  is  low. 

4.3  Evaluation 

Using  the  generalized  design  template,  we  explore  a  range  of  possible  systems  with  maximum 
capacities  of  G4  1024  cores  built  from  4  64  dies.  We  keep  the  cluster  size  the  same  (8  cores),  the 

ratio  of  memory  c'ont rollers  to  cores  the  same  (1:16),  and  the  same  core  density  (0.64  cores/  mm^). 
Table  4.1  shows  what  additional  components  (mixers  or  ADMs)  are  requirc'd  to  build  systems  of 
various  sizes,  dlicae  <xrc  tradeoffs  when  designing  the  base  building  bloek  (die)  for  the  system. 
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Figure  4.4:  Simplified  Add-Drop  Multiplexer  with  two  wavelengths  per  waveguide 


both  ill  terms  of  how  big  it  is  and  how  many  other  blocks  it  expects.  If  the  maximum  system 
size  is  designed  too  small,  it  will  not  be  able  to  scale  to  larger  systems  without  penalties,  but  if 
it  is  designed  too  large,  the  functionality  needed  for  larger  systems  will  waste  area  and  raise  cost 
when  used  in  smaller  systems.  Some  places  where  tliis  tradeoff  becomes  apparent  are:  off-chip 
bandwidth,  off-chip  link  organization,  and  coherency.  For  our  particular  family  of  designs,  how 
populated  the  system  is  does  not  noticeably  affect  performance  once  the  die  size  and  the  maximum 
system  size  have  been  set. 


Table  4.1:  Additional  component  requirements  (mixers  and  ADMs)  per  die  to  scale  the  system  size. 
The  fanout  degree  for  the  ADM  is  on  the  top  line,  while  the  mixer  degree  is  the  bottom  line. 


64 

cores/system 

128 

cores/system 

256 

cores/system 

512 

cores/systein 

1024 

cores /system 

16 

cores /die 

2x 

4x 

8x 

16x 

32 

cores /die 

2x2  (4  wide) 

2x 

2x2  (4  wide) 

4x 

2x2  (4  wide) 

64 

cores /die 

2x2  (8  wide) 

128 

cores/die 

256 

cores/die 

4.3.1  Power 

Since  we  keep  the  bandwidth  per  core  constant,  the  encoding/decoding  })ower  reinains  constant  at 
24mW  per  core,  whether  we  scale  the  number  of  cores  or  the  number  of  dies  to  implement  them 
(Figure  4.5).  Since  some  of  the  higher  core  count  designs  use  additional  rings  for  filters  in  the 
interconnect  (in  ADM’s  and  mixers),  they  will  have  slightly  higher  thermal  tuning  power  but  it  is 
still  negligible.  These  additional  components  will  have  a  much  larger  impact  with  incre<ased  loss  on 
the  optical  critical  path  which  will  increase  the  light  generation  power  significantly. 
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Figure  4.5:  Total  power  per  core 
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Configuration  (Top:  cores/die  Bottom:  cores/system) 
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Figure  4.6:  Laser  power  per  core 


Figure  4.6  shows  the  laser  power  required  for  fully  populated  systems  as  a  function  of  die  size 
and  maxiiuum  system  size.  Systems  that  are  not  fully  populated  require  the  same  laser  power  per 
core  exeej)t  for  when  only  1  or  2  sockets  are  j)opulated  and  the  star  coupler  is  not  needed.  For 
all  die  sizes,  as  the  maximum  supported  system  size  is  increased,  the  required  optical  power  is 
also  increased,  as  expected.  The  rate  at  which  it  increases  can  fluctuate  significantly  because  as 
the  system  size  increases,  some  eomponents  (mixers,  ADMs,  star  fiber  couplers)  are  added  to  the 
interconnect  and  the  loss  rates  of  these  eomponents  varies.  A  more  interesting  trend  is  that  smaller 
systems  are  more  efficiently  eoiistrueted  from  smaller  dies,  as  is  visible  on  the  pareto-optimal  curve 
(underside  of  the  graph).  This  appears  to  indicate  that  systems  with  a  moderate  number  of  sockets 
perform  best  because  of  the  fan-out  costs  associated  with  making  the  all  to  all  eonnectivity.  With 
oTir  selected  technology,  smaller  dies  have  an  advantage  of  shorter  waveguides  (less  loss)  as  shown 
by  the  line  for  16  cores  per  die. 

4.3.2  Latency 

Surprisingly  latency  does  not  get  much  worse  when  breaking  sockets  apart  into  smaller  ones,  even 
if  electrical  links  are  used  off  eliip.  As  visible  back  in  Table  2.2,  both  technologies  get  faster  off 
chip  Jifter  a  minimum  distance  has  been  traversed  to  make  up  for  the  conversion  delay.  Once  the 
overhead  of  getting  onto  a  fiber  is  paid,  the  signal  ean  travel  8em  in  a  clock  cycle  of  our  baseline 
system,  so  within  less  than  a  few  cycles,  everything  is  reachable  by  everything  else  on  board.  The 
only  time  link  latency  is  worrisome  is  when  trying  to  route  a  signal  for  a  long  distance  electrically 
with  a  normal  repeated  wire  on-chip,  but  this  does  not  happen  in  our  design  since  all  long  links 
are  done  photonically.  Even  with  systems  larger  than  the  one  in  Section  3.1,  the  latency  will  not 
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get  iiiueh  bigger.  With  the  largest  con(‘civabl('  board  layouts,  the  link  lat(^ii(*y  will  still  be  less  than 
2iis.  which  will  be  dwarfed  by  the  serialization  latency. 

4.3.3  Area 

In  general  our  photonic  interconnect  fits  well  within  an  area  budget  as  shown  in  Figure  4.7.  These 
are  for  die  designs  that  are  capable  of  supporting  up  to  a  iiiaxiniiiiii  of  1024  cores  in  the  system. 
Since  our  technology  is  using  projected  values,  these  overheads  could  change,  but  we  are  pessimistic 
in  om*  assumptions  about  sizing,  which  results  in  over-estimates  for  axea.  Smaller  dies  use  less  area 
for  the  interconnect,  because  more  of  it  is  off  chip  and  they  are  small  enough  that  it  is  still  possible 
to  j)ut  many  or  all  of  the  waveguides  over  th('  same  air  trenches.  Although  tliis  suggests  less  w^isted 
area  is  another  reason  smellier  dies  will  be  more  cost-effe<*tive,  the  most  imj^ortant  result  is  that 
using  smaller  dies  is  no  worse  than  using  larger  ones,  with  respect  to  area  overhead. 
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Figure  4.7:  Percentage  die  area  taken  by  photonic  network  (not  including  switch  area) 


4.3.4  Discussion 

The  main  lesson  is  that  there  is  a  tradeoff  between  integration  and  disintegration.  The  models  may 
not  be  able  to  fully  capture  all  of  the  penalties  of  having  many  small  dies,  but  dies  smaller  than 
the  ones  currently  us(xi  may  be  feasible  to  make  systems  with  a  moderate  number  of  sockets. 

Historically  systems  have  been  built  out  of  dies  as  large  as  is  reasonable  to  manufac  ture  bwause 
of  the  interconnect  penalties  of  traversing  socket  boundaries.  This  sometimes  results  in  paying  a 
significant  premium  to  fabricate  larger  monolithic  dies.  The  photonic  network  template  presented 
could  reduce  the  barrier  to  multi-socket  designs,  enabling  a  new  system  design  methodolog>'  of 
pi('kiiig  a  die  size  that  is  cheapest  to  maimfacturcy  and  then  using  as  many  dies  as  needed  to  build 
the  d(^sired  system. 
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Chapter  5 


Related  Work 


The  work  of  Batten  et  al.  [4],  identified  the  potential  for  monolithic  silicon  photonics  for  making 
an  interconnection  network  to  connect  DRAM  to  processing  cores.  We  us(^  their  technology 
assumptions  and  baseline  machine  as  a  starting  point  for  our  work.  Our  work  differs  in  that  it  adds 
the  contributions  of  using  multi-socket  systems  as  a  way  to  reduce  cost  and  considering  coherence 
much  more  closely.  Much  of  the  other  related  work  focusses  on  the  on-chip  network  for  a  single 
chip  and  does  not  consider  anything  off-cliip  [7,  13,  16,  22,  24]. 

Kirman  et  al.  presented  a  photonic  on-chip  interconnect  in  [16].  Their  architecture  attempted 
to  utilize  each  interconnection  topology  for  the  range  of  distances  it  was  best  at.  They  subdivided 
a  CMP  into  four  blocks  and  those  fom:  blocks  were  connected  by  a  photonic  ring  topology.  Within 
a  block  electrical  interconnects  were  used  at  a  distance  where  they  were  advantageous  to  optical. 
Our  network  topologies  were  influenced  by  this,  but  we  have  made  a  more  ambitious  design  that 
uses  a  more  optimistic  photonic  technology. 

Shacham  et  al.  present  a  photonic  NoC  for  a  multiprocessor  system  that  uses  photonic  switches 
built  from  crossings  and  resonant  rings  [24].  To  set  up  a  link,  an  electric  control  signal  must  travel 
ahead  in  parallel  to  the  path  to  set  up  the  switches.  This  enables  them  to  get  higher  bandwidth 
utilization  on  their  links  than  a  point  to  point  system  fike  what  was  presented  in  this  paper,  but 
at  the  cost  of  path  set  up  latency  and  the  possibility  of  network  contention.  As  a  consequence  of 
the  set  up  requirements,  they  get  the  best  j^erformance  from  lightly  contested  bulk  transfers. 

Joshi  et  al.  present  a  low-diameter  j^hotonic  Clos  network  and  eompare  it  to  electrical  alter¬ 
natives  [13].  Their  low-diameter  network  is  motivated  by  the  same  desire  of  this  work  to  provide 
uniform  bandwidth  wliile  taking  advantage  of  the  distance  insensitivity  of  photonic  links.  Unlike 
this  work,  their  Clos  network  is  able  to  utilize  path  diversity,  but  this  would  be  harder  to  implement 
for  the  multisocket  case  because  there  are  more  endpoints  to  connect. 

Phastlane  [7]  intends  to  bring  the  benefits  of  photonics  to  a  dimension  ordered  mesh  network. 
Since  light  propagates  quickly,  they  allow  a  packet  to  sometimes  travel  multiple  hops  in  a  single 
cycle.  Unlike  [24],  they  set  up  each  hop  with  an  optical  control  signal  that  travels  in  parallel  to  the 
data  payload.  When  there  is  eontention,  the  j^acket  will  travel  less  hops  in  a  cycle  and  is  stored  in 
an  electrical  buffer.  If  the  buffer  is  full,  the  packet  is  dropped,  and  the  sender  is  notified.  Of  the 
j>hotonic  proposals,  Phastlane  is  the  only  one  to  consider  not  providing  reliable  transmission  at  the 
link  layer. 

Firefly  [22]  presents  a  hierarchieal  NoC.  Similar  to  [16],  it  subdivides  the  chip  into  clusters,  and 
within  a  cluster  it  uses  electrical  links  and  between  clusters  it  uses  photonic  links.  The  photonic 
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links  use  a  crosslw,  but  to  ])reveiit  tlie  iK'od  for  global  arbitration  tlioy  break  it  up  into  multiple 
logical  crossbars. 

Proximity  interconnect  [9]  is  an  interesting  t(»clinology  that  is  trying  to  solve  many  of  tin'  same 
])roblems  our  photonic  socket-level  interconnect  is.  It  places  dies  very  close  together  and  uses 
capacitive  coupling  to  transmit  data  without  actual  wire  contacts.  By  doing  so,  it  is  able  to  obtain 
jutches  and  band  widths  comparable  to  on-thip  wires.  They  have  aspirations  similar  to  ours  for  its 
use  whether  it  be  making  small  dies  to  reduce  cost  or  combining  large  dies  to  approach  wafer  scale 
integration.  Photonics,  especially  with  DWDM  should  be  able  to  achieve  even  liigher  bandwidtlis 
and  is  a  little  more  robust  of  a  technology  since  the  exact  relative  alignment  of  two  dies  does  not 
matter  as  much. 

Three  dimensional  die  stacking  is  another  technology  with  tlie  same  motivation,  but  it  could 
be  used  in  conjunction  with  a  photonic  interconnect  like  in  Corona  [25].  They  place  their  photonic 
network  on  its  own  die  to  give  them  more  area  and  let  them  use  better  photonic  materials  which 
allows  them  to  build  more  complicatcni  networks.  They  use  a  large  serpentine  crossbar  which 
has  orders  of  magnitude  more  components  than  our  networks  and  would  be  infeasible  with  our 
monolitliically  integrated  photonics  technology.  As  such,  they  burn  significantly  more  laser  power 
than  our  design  for  comparable  bandwidth,  but  it  is  hard  to  accurately  make  this  compcirison  since 
they  are  using  a  different  photonic  technology. 
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Chapter  6 

Conclusion 


In  this  work,  w^e  present  design  techniques  that  produce  a  general  network  template  that  can  be 
scaled  to  handle  varying  numbers  of  cores  and  sockets.  To  scale  our  network  design  to  even  larger 
core  counts  will  probably  require  moving  to  a  multi- hop  network. 

Cliip  disintegration  may  seem  counterintuitive  for  performance  reasons,  but  with  our  photonic 
network,  the  performance  degradation  is  made  small  enough  that  the  cost  incentives  out  weigh 
it.  This  could  allow  for  a  re-thinking  of  the  design  process  where  systems  cire  built  out  of  the 
appropriate  number  of  the  most  economically  sized  die. 

Due  to  the  current  state  of  silicon  photonic  research,  multi-socket  memory  interconnects  are  a 
great  application.  In  the  near  horizon,  photonics  provides  great  advantages  over  electrical  at  the 
scale  of  on- board /off-chip.  To  optimize  these  multi-socket  systems,  photonics  should  be  uscxl  to 
communicate  directly  with  DRAM,  which  will  remove  the  last  bit  of  wasteful  off-chip  electrical 
signaling.  Further  advances,  such  as  efficient  integrated  lasers,  will  enable  photonics  research  to 
continue  to  decrease  the  scale  at  which  optical  communication  is  advantageous,  possibly  opening 
up  the  chip  micro-architecture  as  the  next  interesting  application. 
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Appendix  A 

Coherence  Considerations 


To  make  this  system  more  realizable  it  will  need  a  coherency  scheme  (protocol  and  hardware 
implementation)  to  turn  the  network  into  a  memory  interconnect,  which  is  something  past  <^lesigns 
have  not  given  much  consideration  to.  Especially  for  the  general  architecture  presentixl  in  this 
paper,  it  is  essential  that  the  coherency  scheme  achieve  the  same  goals  of  reusability  and  scalability. 
We  want  the  same  design  to  be  able  to  handle  different  binary  amounts  of  populated  sockets  in 
the  system  without  unreasonable  overheard.  Our  system  uses  shared  memory,  ^ind  coherency  is 
maintained  amongst  all  caches  by  a  two  level  protocol  corresponding  to  within  ajid  between  clusters. 

A.l  Intra-Cluster  Coherence 

Witliin  a  cluster,  each  core  has  its  own  private  LI  cache  and  they  all  communicate  through  a  shared 
L2  cache.  The  L2  cache  is  not  inclusive  of  the  Lis,  but  it  does  store  duplicates  of  the  tags.  We 
envision  using  this  with  a  protocol  similar  to  what  was  described  in  Piranha  [2].  This  protocol 
will  be  responsible  for  keeping  the  caches  within  each  cluster  coherent,  and  requests  that  it  cannot 
handle  will  be  passed  up  to  the  next  level  coherency  protocol. 

A. 2  Inter-Cluster  Coherence 

To  maintain  coherence  between  clusters  we  use  a  4- hop  MESI  directory  protocol.  Prom  the  point 
of  view  of  the  directory,  all  caches  in  a  cluster  are  lumped  together  and  treated  as  one.  We 
position  a  directory  by  every  memory  controller  so  it  can  intercept  requests  to  memory  and  take 
the  appropriate  protocol  actions.  A  directory  is  only  resj^onsible  for  the  memory  locations  its 
associated  memory  controller  provides.  The  protocol  uses  4  hops  because  there  is  no  core  to  core 
network,  so  all  inter-core  traffic  must  be  routed  through  the  memory  controller. 

To  make  the  directory  small  enough  to  fit  on  chip  rather  than  off-chip  DRAM,  we  use  a  reverse 
tagged  directory  implemented  with  a  Content  Addressable  Memory  (CAM).  For  every  cache  line 
it  is  responsible  for,  the  directory  contains  a  duplicate  of  the  cache  tag  and  a  few  bits  of  protocol 
state.  We  reduce  the  associativity  required  for  the  directory  by  implementing  it  with  many  small 
CAMs  where  each  one  corresponds  to  a  cache  set.  When  a  request  is  being  looked  up,  only  the 
CAM  corresponding  to  the  request’s  set  needs  to  be  examined.  A  cache  tag’s  location  in  the  reverse 
directory  implicitly  identifies  the  location  of  its  owner.  Because  all  the  caches  in  the  system  are  set 
associative,  this  puts  a  limit  on  the  number  of  possible  cache  lines  that  could  hold  a  block,  namely 
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Nk  if  the  system  has  N  clusters  and  each  one  is  fc-way  set  associative.  If  this  associativity  is  still 
too  high,  multiple  CAM  arrays  could  he  used  which  will  still  be  faster  and  cheaper  than  going  to 
a  direct  mapped  directory  implemented  by  off-chip  DRAM. 

Although  photonics  provides  great  bandwidth  which  might  tempt  one  to  snoop,  the  energy  cost 
at  the  endpoints  to  do  associative  lookups  for  every  message  at  every  cluster  in  the  system  will  be 
j)rohibitive,  especially  as  it  scales.  With  snooping,  for  a  given  protocol  miss  (like  a  write  miss), 
rather  than  searching  the  state  of  one  cluster  and  the  home  directory,  every  cluster  will  need  to  be 
searched.  This  will  also  require  a  broadcast  mechanism,  which  our  current  network  topology  does 
not  provide.  It  could  be  possible  to  design  it,  but  our  topology  was  designed  to  minimally  meet 
our  goals  and  our  coherency  protocol  works  well  without  it.  The  bandwidth  savings  a  directory 
protocol  provides  will  also  help  the  system  scale  to  higher  core  counts  and  ('onserve  energy. 

A.  3  Reusability 

To  support  a  variable  number  of  populated  sockets  the  way  memory  addresses  are  interleaved  can 
be  leveraged.  For  a  given  die  size,  if  the  number  of  populated  sockets  is  doubled,  the  number  of 
cache  lines  double,  however  the  number  of  sets  per  cache  that  can  address  a  particular  memory 
controller  get  halved,  so  the  number  of  possible  locations  a  directory  needs  to  be  concerned  with 
stays  the  same.  The  only  thing  that  changes  is  the  implicit  addressing  of  clusters  to  tags  in  the 
reverse  directory. 


A. 4  Directory  Implementation  Feasibility 

To  prove  the  feasibility  of  such  a  technique,  we  present  a  rough  model  of  what  reverse  tagged 
directories  would  cost  by  scaling  [5]  down  to  22  nm.  To  stress  our  design,  we  target  the  maximum 
size  system  our  network  targeted:  1024  cores  over  1600  mm^  of  silicon.  The  target  system  uses  a 
48-bit  physical  address.  Each  cluster  has  4MB  of  L2  cache  that  is  8-way  set  associative. 

To  implement  the  CAMs  efficiently,  we  use  a  pre-computation  based  CAM  [19]  with  a  Half- 
NOR  cell  size  of  0.34 //m^  and  a  NAND  cell  size  of  0.3695 //m^.  For  the  CAM  arrays  alone,  this 
would  take  50.531  inm^,  so  rounding  up  generously  for  extra  decode  and  control  logic,  this  could 
be  implemented  in  80  mm^  which  is  only  5%  of  the  total  silicon  area. 

The  power  required  is  harder  to  estimate  due  to  its  dependence  on  workload  and  coherence 
traffic.  In  45  nm[5]  each  search  took  0.14^,  so  including  decode  overheads  and  pessimistic  energy 
scaling  0.1^  might  be  possible  in  22  nm.  Assuming  the  wildly  high  coherence  activity  rate  of 
each  core  needing  to  access  the  directory  once  every  five  instructions  results  in  0.786W  total.  This 
amount  will  almost  surely  be  drowned  out  by  static  power  of  the  SRAMs  included  to  hold  the 
CAMs’  state.  The  dynamic  search  power  makes  up  such  a  small  portion  of  the  directory’s  power 
because  the  cache  set  partitioning  makes  the  relative  activity  factor  of  any  CAM  cell  quite  low. 

The  latency  of  the  directory  itself  should  be  quite  tolerable.  Even  without  much  speed  im¬ 
provement  from  process  technology  <md  accounting  for  controller  overhead,  it  should  be  possible 
to  get  a  search  done  in  under  a  nanosc'cond  [5].  Tliis  should  clearly  win  by  more  than  an  order  of 
magnitude  comj)ared  to  off-chip  DRAM.  Overall  we  believe  we  could  make  an  effective  coherence 
mechanism  utilizing  reverse  tagged  directories  built  from  on-chip  CAMs. 
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